I am struggling with some very basic spark code. I would like to define a matrix x with 2 columns. This is what I have tried:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0) // in this case I want s to be both column 1 and column 2 of x
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val mat = new RowMatrix(ss, 5, 2)
<console>:17: error: type mismatch;
found : Array[Double]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val mat = new RowMatrix(ss, 5, 2)
I do not understand how I can get the right transformation in order to pass the values to the distributed matrix ^
EDIT:
Maybe I have been able to solve:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0)
scala> val ss = s.to
toArray toDenseMatrix toDenseVector toScalaVector toString
toVector
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> val x = new breeze.linalg.Dense
DenseMatrix DenseVector
scala> val x = new breeze.linalg.DenseMatrix(5, 2, ss)
x: breeze.linalg.DenseMatrix[Double] =
-3.0 -3.0
-1.5 -1.5
0.0 0.0
1.5 1.5
3.0 3.0
scala> val xDist = sc.parallelize(x.toArray)
xDist: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:18
Something like this. This typechecks, but for some reason won't run in my Scala worksheet.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
// the values for the column in each row
val col = List(-3.0, -1.5, 0.0, 1.5, 3.0) ;
// make two rows of the column values, transpose it,
// make Vectors of the result
val t = List(col,col).transpose.map(r=>Vectors.dense(r.toArray))
// make an RDD from the resultant sequence of Vectors, and
// make a RowMatrix from that.
val rm = new RowMatrix(sc.makeRDD(t));
Related
Scala spire is giving the following result. As per my understanding goes it must give List((0.0,0.1],[3.0,5.0)). Why such result?
scala> val x = Interval.openLower(0.0,0.1)
x: spire.math.Interval[Double] = (0.0, 0.1]
scala> val y = Interval.openUpper(3.0,5.0)
y: spire.math.Interval[Double] = [3.0, 5.0)
scala> x.union(y)
res0: spire.math.Interval[Double] = (0.0, 5.0)
And also
val S = Interval.open(1.0, 4.5)
val A = Interval.open(1.0, 3.0)
val B = Interval.open(2.0, 4.0)
val C = Interval.openUpper(3.0, 4.5)
println(S \ (A ∩ B))
val list = (S \ A).union(S \ B)
println(list)
The result is
List((1.0, 2.0], [3.0, 4.5))
List([3.0, 4.5), (1.0, 2.0], [4.0, 4.5))
How shall i unify the lower result to upper so that both will be equal.
I ran into the same issue and found out that Spire's IntervalSeq gets the job done.
// ammonite script intervals.sc
import $ivy.`org.typelevel::spire:0.17.0-M1`
import $ivy.`org.typelevel::spire-extras:0.17.0-M1`
import spire.math.Interval
import spire.math.extras.interval.IntervalSeq
import spire.implicits._
val S = IntervalSeq(Interval.open(1.0, 4.5))
val A = IntervalSeq(Interval.open(1.0, 3.0))
val B = IntervalSeq(Interval.open(2.0, 4.0))
val C = IntervalSeq(Interval.openUpper(3.0, 4.5))
val r1 = (S ^ (A & B))
println("r1=>" + r1.intervals.toList)
val r2 = ((S ^ A) | (S ^ B))
println("r2=>" + r2.intervals.toList)
Running this using the Ammonite REPL results in the following output:
r1=>List((1.0, 2.0], [3.0, 4.5))
r2=>List((1.0, 2.0], [3.0, 4.5))
I have a vector of type scala.collection.immutable.Vector and would like to convert it to a vector of type org.apache.spark.ml.linalg.Vector.
For example, I want something like the following;
import org.apache.spark.ml.linalg.Vectors
val scalaVec = Vector(1,2,3)
val sparkVec = Vectors.dense(scalaVec)
Note that I could simply type val sparkVec = Vectors.dense(1,2,3) but I want to convert existing scala collection Vectors. I want to do this to embed these DenseVectors in a DataFrame to feed into spark.ml pipelines.
Vectors.dense can take an array of doubles. Likely what is causing your trouble is that Vectors.dense won't accept Ints which you are using in scalaVec in your example. So the following fails:
val test = Seq(1,2,3,4,5).to[scala.Vector].toArray
Vectors.dense(test)
import org.apache.spark.ml.linalg.Vectors
test: Array[Int] = Array(1, 2, 3, 4, 5)
<console>:67: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.ml.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.ml.linalg.Vector cannot be applied to (Array[Int])
Vectors.dense(test)
While this works:
val testDouble = Seq(1,2,3,4,5).map(x=>x.toDouble).to[scala.Vector].toArray
Vectors.dense(testDouble)
testDouble: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
res11: org.apache.spark.ml.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]
You can pass vector element as var-args as follows :
val scalaVec = Vector(1, 2, 3)
val sparkVec = Vectors.dense(scalaVec:_*)
I have a 2d list of integers and I would like to convert it to either RDD[vector] or JavaRDD[vector] in order to use the predict method of the SVM model in spark MLlib.
I have tried the following, in order to convert it to rdd. But it seems that this is not what I need.
val tuppleSlides = encoded.iterator.sliding(10).toList
val rdd = sc.parallelize(tuppleSlides)
Any ideas what is the command to convert it to the right type?
Thank you in advance.
If you want to use MLlib you will need an RDD[LabeledPoint]. Given your 2D list of data and some list of labels, you can create your RDD[LabeledPoint] like so:
scala> val labels = List(1.0, -1.0)
labels: List[Double] = List(1.0, -1.0)
scala> val myData = List(List(1d,2d), List(3d,4d))
myData: List[List[Double]] = List(List(1.0, 2.0), List(3.0, 4.0))
scala> import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.feature.LabeledPoint
scala> val vectors = myData.map(x => Vectors.dense(x.toArray))
vectors: List[org.apache.spark.ml.linalg.Vector] = List([1.0,2.0], [3.0,4.0])
scala> val labPts = labels.zip(vectors).map{case (l, fV) => LabeledPoint(l, fV)}
labPts: List[org.apache.spark.ml.feature.LabeledPoint] = List((1.0,[1.0,2.0]), (-1.0,[3.0,4.0]))
scala> val myRDD = sc.parallelize(labPts)
myRDD: org.apache.spark.rdd.RDD[org.apache.spark.ml.feature.LabeledPoint] = ParallelCollectionRDD[0] at parallelize at <console>:34
I try to use (1,), but doesn't work, what's the syntax to define Tuple1 in scala ?
scala> val a=(1,)
<console>:1: error: illegal start of simple expression
val a=(1,)
For tuple with cardinality 2 or more, you can use parentheses, however for with cardinality 1, you need to use Tuple1:
scala> val tuple1 = Tuple1(1)
tuple1: (Int,) = (1,)
scala> val tuple2 = ('a', 1)
tuple2: (Char, Int) = (a,1)
scala> val tuple3 = ('a', 1, "name")
tuple3: (Char, Int, java.lang.String) = (a,1,name)
scala> tuple1._1
res0: Int = 1
scala> tuple2._2
res1: Int = 1
scala> tuple3._1
res2: Char = a
scala> tuple3._3
res3: String = name
To declare the type, use Tuple1[T], for example val t : Tuple1[Int] = Tuple1(22)
A tuple is, by definition, an ordered list of elements. While Tuple1 exists, I haven't seen it used explicitly given you'd normally use a single element. Nevertheless, there is no sugar, you need to use Tuple1(1).
There is a valid use case in Spark that requires Tuple1: create a dataframe with one column.
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
data.toDF("features").show()
It will throw an error:
"value toDF is not a member of Seq[org.apache.spark.ml.linalg.Vector]"
To make it work, we have to convert each row to Tuple1:
val data = Seq(
Tuple1(Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))),
Tuple1(Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0)),
Tuple1(Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
)
or a better way:
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
).map(Tuple1.apply)
I'm new to Spark and Scala and I'm trying to read its documentation on MLlib.
The tutorial on http://spark.apache.org/docs/1.4.0/mllib-data-types.html,
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
does not show how to construct an RDD[Vector] (variable rows) from a list of local vectors.
So for example, I have executed (as part of my exploration) in spark-shell
val v0: Vector = Vectors.dense(1.0, 0.0, 3.0)
val v1: Vector = Vectors.sparse(3, Array(1), Array(2.5))
val v2: Vector = Vectors.sparse(3, Seq((0, 1.5),(1, 1.8)))
which if 'merged' will look like this matrix
1.0 0.0 3.0
0.0 2.5 0.0
1.5 1.8 0.0
So, how do I transform Vectors v0, v1, v2 to rows?
By using the property of Spark Context which parallelize the Sequence, we can achieve the thing you want, Since you have created vectors,now all you required to bring them in sequence and parallelize by the process given below.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val v0 = Vectors.dense(1.0, 0.0, 3.0)
val v1 = Vectors.sparse(3, Array(1), Array(2.5))
val v2 = Vectors.sparse(3, Seq((0, 1.5), (1, 1.8)))
val rows = sc.parallelize(Seq(v0, v1, v2))
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()