How to convert scala vector to spark ML vector? - scala

I have a vector of type scala.collection.immutable.Vector and would like to convert it to a vector of type org.apache.spark.ml.linalg.Vector.
For example, I want something like the following;
import org.apache.spark.ml.linalg.Vectors
val scalaVec = Vector(1,2,3)
val sparkVec = Vectors.dense(scalaVec)
Note that I could simply type val sparkVec = Vectors.dense(1,2,3) but I want to convert existing scala collection Vectors. I want to do this to embed these DenseVectors in a DataFrame to feed into spark.ml pipelines.

Vectors.dense can take an array of doubles. Likely what is causing your trouble is that Vectors.dense won't accept Ints which you are using in scalaVec in your example. So the following fails:
val test = Seq(1,2,3,4,5).to[scala.Vector].toArray
Vectors.dense(test)
import org.apache.spark.ml.linalg.Vectors
test: Array[Int] = Array(1, 2, 3, 4, 5)
<console>:67: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.ml.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.ml.linalg.Vector cannot be applied to (Array[Int])
Vectors.dense(test)
While this works:
val testDouble = Seq(1,2,3,4,5).map(x=>x.toDouble).to[scala.Vector].toArray
Vectors.dense(testDouble)
testDouble: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
res11: org.apache.spark.ml.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]

You can pass vector element as var-args as follows :
val scalaVec = Vector(1, 2, 3)
val sparkVec = Vectors.dense(scalaVec:_*)

Related

length of each word in array by using scala

I have data like this below. In an array we have different words
scala> val x=rdd.flatMap(_.split(" "))
x: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at flatMap at <console>:26
scala> x.collect
res41: Array[String] = Array(Roses, are, red, Violets, are, blue)
I want find the length of each word in an array in scala
Spark allows you to chain the functions that are defined on a RDD[T], which is RDD[String] in your case. You can add the map function following your flatMap function to get the lengths.
val sentence: String = "Apache Spark is a cluster compute engine"
val sentenceRDD = sc.makeRDD(List(sentence))
val wordLength = sentenceRDD.flatMap(_.split(" ")).map(_.length)
wordLength.take(2)
For instance I'll use your value x to show the demonstration:
we can do something like this to find the length of each word in array in scala
>x.map(s => s -> s.length)
This will print out the following:
Array[(String, Int)] = Array((Roses,5), (are,3), (red,3), (Violets,7), (are,3), (blue,4))
In the case, if you are using Spark. Then change as follows:
>x.map(s => s -> s.length).collect()
This will print out the following:
Array[(String, Int)] = Array((Roses,5), (are,3), (red,3), (Violets,7), (are,3), (blue,4))
If you want only the length of each word then use this:
>x.map(_.length).collect()
Output:
Array(5,3,3,7,3,4)
you can just give ...
val a = Array("Roses", "are", "red", "Violets", "are", "blue")
var b = a.map(x => x.length)
This will give you Array[Int] = Array(5, 3, 3, 7, 3, 4)

convert a 2d list to RDD[vector] or JavaRDD[vector] scala

I have a 2d list of integers and I would like to convert it to either RDD[vector] or JavaRDD[vector] in order to use the predict method of the SVM model in spark MLlib.
I have tried the following, in order to convert it to rdd. But it seems that this is not what I need.
val tuppleSlides = encoded.iterator.sliding(10).toList
val rdd = sc.parallelize(tuppleSlides)
Any ideas what is the command to convert it to the right type?
Thank you in advance.
If you want to use MLlib you will need an RDD[LabeledPoint]. Given your 2D list of data and some list of labels, you can create your RDD[LabeledPoint] like so:
scala> val labels = List(1.0, -1.0)
labels: List[Double] = List(1.0, -1.0)
scala> val myData = List(List(1d,2d), List(3d,4d))
myData: List[List[Double]] = List(List(1.0, 2.0), List(3.0, 4.0))
scala> import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.feature.LabeledPoint
scala> val vectors = myData.map(x => Vectors.dense(x.toArray))
vectors: List[org.apache.spark.ml.linalg.Vector] = List([1.0,2.0], [3.0,4.0])
scala> val labPts = labels.zip(vectors).map{case (l, fV) => LabeledPoint(l, fV)}
labPts: List[org.apache.spark.ml.feature.LabeledPoint] = List((1.0,[1.0,2.0]), (-1.0,[3.0,4.0]))
scala> val myRDD = sc.parallelize(labPts)
myRDD: org.apache.spark.rdd.RDD[org.apache.spark.ml.feature.LabeledPoint] = ParallelCollectionRDD[0] at parallelize at <console>:34

Convert local Vectors to RDD[Vector]

I'm new to Spark and Scala and I'm trying to read its documentation on MLlib.
The tutorial on http://spark.apache.org/docs/1.4.0/mllib-data-types.html,
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
does not show how to construct an RDD[Vector] (variable rows) from a list of local vectors.
So for example, I have executed (as part of my exploration) in spark-shell
val v0: Vector = Vectors.dense(1.0, 0.0, 3.0)
val v1: Vector = Vectors.sparse(3, Array(1), Array(2.5))
val v2: Vector = Vectors.sparse(3, Seq((0, 1.5),(1, 1.8)))
which if 'merged' will look like this matrix
1.0 0.0 3.0
0.0 2.5 0.0
1.5 1.8 0.0
So, how do I transform Vectors v0, v1, v2 to rows?
By using the property of Spark Context which parallelize the Sequence, we can achieve the thing you want, Since you have created vectors,now all you required to bring them in sequence and parallelize by the process given below.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val v0 = Vectors.dense(1.0, 0.0, 3.0)
val v1 = Vectors.sparse(3, Array(1), Array(2.5))
val v2 = Vectors.sparse(3, Seq((0, 1.5), (1, 1.8)))
val rows = sc.parallelize(Seq(v0, v1, v2))
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()

How to easily convert IndexedSeq[Array[Int]] to Seq[Seq[Int]] in Scala?

I have a function that takes a list of lists of integer, specifically Seq[Seq[Int]]. Then I produce this data from reading a text file and using split, and that produces a list of Array. That is not recognized by Scala, who raises a match error. But either IndexedSeq or Array alone are OK with a Seq[Int] function, apparently only the nested collection is an issue. How can I convert implicitly IndexedSeq[Array[Int]] to Seq[Seq[Int]], or how else could I do this other than using toList as demonstrated below? Iterable[Iterable[Int]] seems to be fine, for instance, but I can't use this.
scala> def g(x:Seq[Int]) = x.sum
g: (x: Seq[Int])Int
scala> g("1 2 3".split(" ").map(_.toInt))
res6: Int = 6
scala> def f(x:Seq[Seq[Int]]) = x.map(_.sum).sum
f: (x: Seq[Seq[Int]])Int
scala> f(List("1 2 3", "3 4 5").map(_.split(" ").map(_.toInt)))
<console>:9: error: type mismatch;
found : List[Array[Int]]
required: Seq[Seq[Int]]
f(List("1 2 3", "3 4 5").map(_.split(" ").map(_.toInt)))
^
scala> f(List("1 2 3", "3 4 5").map(_.split(" ").map(_.toInt).toList))
res8: Int = 18
The problem is that Array does not implement SeqLike. Normally, implicit conversions to ArrayOps or WrappedArray defined in scala.predef allow to use array just like Seq. However, in your case array is 'hidden' from implicit conversions as a generic argument. One solution would be to hint compiler that you can apply an implicit conversion to the generic argument like this:
def f[C <% Seq[Int]](x:Seq[C]) = x.map(_.sum).sum
This is similar to Paul's response above. The problem is that view bounds are deprecated in Scala 2.11 and using deprecated language features is not a good idea. Luckily, view bounds can be rewritten as context bounds as follows:
def f[C](x:Seq[C])(implicit conv: C => Seq[Int]) = x.map(_.sum).sum
Now, this assumes that there is an implicit conversion from C to Seq[Int], which is indeed present in predef.
How about this:
implicit def _convert(b:List[Array[Int]]):Seq[Seq[Int]]=b.map(_.toList)
Redefine f to be a bit more flexible.
Since Traversable is a parent of List, Seq, Array, etc., f will be compatible with these containers if it based on Traversable. Traversable has sum, flatten, and map, and that is all that's needed.
What is tricky about this is that
def f(y:Traversable[Traversable[Int]]):Int = y.flatten.sum
is finicky and doesn't work on a y of type List[Array[Int]] although it will work on Array[List[Int]]
To make it less finicky, some type view bounds will work.
Initially, I replaced your sum of sums with a flatten/sum operation.
def f[Y<%Traversable[K],K<%Traversable[Int]](y:Y):Int=y.flatten.sum
I found this also seems to work but I did not test as much:
def f[Y <% Traversable[K], K <% Traversable[Int]](y:Y):Int=y.map(_.sum).sum
This <% syntax says Y is viewable as Traversable[K] for some type K that is viewable as a Traversable of Int.
Define some different containers, including the one you need:
scala> val myListOfArray = List(Array(1,2,3),Array(3,4,5))
val myListOfArray = List(Array(1,2,3),Array(3,4,5))
myListOfArray: List[Array[Int]] = List(Array(1, 2, 3), Array(3, 4, 5))
scala> val myArrayOfList = Array(List(1,2,3),List(3,4,5))
val myArrayOfList = Array(List(1,2,3),List(3,4,5))
myArrayOfList: Array[List[Int]] = Array(List(1, 2, 3), List(3, 4, 5))
scala> val myListOfList = List(List(1,2,3),List(3,4,5))
val myListOfList = List(List(1,2,3),List(3,4,5))
myListOfList: List[List[Int]] = List(List(1, 2, 3), List(3, 4, 5))
scala> val myListOfRange = List(1 to 3, 3 to 5)
val myListOfRange = List(1 to 3, 3 to 5)
myListOfRange: List[scala.collection.immutable.Range.Inclusive] = List(Range(1, 2, 3), Range(3, 4, 5))
Test:
scala> f(myListOfArray)
f(myListOfArray)
res24: Int = 18
scala> f(myArrayOfList)
f(myArrayOfList)
res25: Int = 18
scala> f(myListOfList)
f(myListOfList)
res26: Int = 18
scala> f(myListOfRange)
f(myListOfRange)
res28: Int = 18

Convert tuple to a List of first item

Say I have a method that returns this.
Vector[ (PkgLine, Tree) ]()
I want to convert this to a List of PkgLines. I want to drop the Tree off. I'm not seeing anything in the scala library that would allow me to do this. Anybody have any simple ideas? Thanks.
val list = vector.map(_._1).toList
If you have a Tupel t, you can access its first element using t._1. So with the map operation, you're effectively throwing away the trees, and store the PkgLines directly. Then you simply convert the Vector to List.
Using map with a selector of the first element of the pair works:
scala> val v = Vector[(Int,String)]((5,"5"), (42,"forty-two"))
v: ... = Vector((5,5), (42,forty-two))
scala> v.map(_._1).toList
resN: List[Int] = List(5, 42)
Alternatively, you can use unzip:
scala> val (ints,strings) = v.unzip
ints: scala.collection.immutable.Vector[Int] = Vector(5, 42)
strings: scala.collection.immutable.Vector[String] = Vector(5, forty-two)
scala> ints.toList
resN: List[Int] = List(5, 42)