I'm new to Spark and Scala and I'm trying to read its documentation on MLlib.
The tutorial on http://spark.apache.org/docs/1.4.0/mllib-data-types.html,
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
does not show how to construct an RDD[Vector] (variable rows) from a list of local vectors.
So for example, I have executed (as part of my exploration) in spark-shell
val v0: Vector = Vectors.dense(1.0, 0.0, 3.0)
val v1: Vector = Vectors.sparse(3, Array(1), Array(2.5))
val v2: Vector = Vectors.sparse(3, Seq((0, 1.5),(1, 1.8)))
which if 'merged' will look like this matrix
1.0 0.0 3.0
0.0 2.5 0.0
1.5 1.8 0.0
So, how do I transform Vectors v0, v1, v2 to rows?
By using the property of Spark Context which parallelize the Sequence, we can achieve the thing you want, Since you have created vectors,now all you required to bring them in sequence and parallelize by the process given below.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val v0 = Vectors.dense(1.0, 0.0, 3.0)
val v1 = Vectors.sparse(3, Array(1), Array(2.5))
val v2 = Vectors.sparse(3, Seq((0, 1.5), (1, 1.8)))
val rows = sc.parallelize(Seq(v0, v1, v2))
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
Related
I have a vector of type scala.collection.immutable.Vector and would like to convert it to a vector of type org.apache.spark.ml.linalg.Vector.
For example, I want something like the following;
import org.apache.spark.ml.linalg.Vectors
val scalaVec = Vector(1,2,3)
val sparkVec = Vectors.dense(scalaVec)
Note that I could simply type val sparkVec = Vectors.dense(1,2,3) but I want to convert existing scala collection Vectors. I want to do this to embed these DenseVectors in a DataFrame to feed into spark.ml pipelines.
Vectors.dense can take an array of doubles. Likely what is causing your trouble is that Vectors.dense won't accept Ints which you are using in scalaVec in your example. So the following fails:
val test = Seq(1,2,3,4,5).to[scala.Vector].toArray
Vectors.dense(test)
import org.apache.spark.ml.linalg.Vectors
test: Array[Int] = Array(1, 2, 3, 4, 5)
<console>:67: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.ml.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.ml.linalg.Vector cannot be applied to (Array[Int])
Vectors.dense(test)
While this works:
val testDouble = Seq(1,2,3,4,5).map(x=>x.toDouble).to[scala.Vector].toArray
Vectors.dense(testDouble)
testDouble: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
res11: org.apache.spark.ml.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]
You can pass vector element as var-args as follows :
val scalaVec = Vector(1, 2, 3)
val sparkVec = Vectors.dense(scalaVec:_*)
I have a 2d list of integers and I would like to convert it to either RDD[vector] or JavaRDD[vector] in order to use the predict method of the SVM model in spark MLlib.
I have tried the following, in order to convert it to rdd. But it seems that this is not what I need.
val tuppleSlides = encoded.iterator.sliding(10).toList
val rdd = sc.parallelize(tuppleSlides)
Any ideas what is the command to convert it to the right type?
Thank you in advance.
If you want to use MLlib you will need an RDD[LabeledPoint]. Given your 2D list of data and some list of labels, you can create your RDD[LabeledPoint] like so:
scala> val labels = List(1.0, -1.0)
labels: List[Double] = List(1.0, -1.0)
scala> val myData = List(List(1d,2d), List(3d,4d))
myData: List[List[Double]] = List(List(1.0, 2.0), List(3.0, 4.0))
scala> import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.feature.LabeledPoint
scala> val vectors = myData.map(x => Vectors.dense(x.toArray))
vectors: List[org.apache.spark.ml.linalg.Vector] = List([1.0,2.0], [3.0,4.0])
scala> val labPts = labels.zip(vectors).map{case (l, fV) => LabeledPoint(l, fV)}
labPts: List[org.apache.spark.ml.feature.LabeledPoint] = List((1.0,[1.0,2.0]), (-1.0,[3.0,4.0]))
scala> val myRDD = sc.parallelize(labPts)
myRDD: org.apache.spark.rdd.RDD[org.apache.spark.ml.feature.LabeledPoint] = ParallelCollectionRDD[0] at parallelize at <console>:34
I'm trying to convert edge list which is in the following format
data = [('a', 'developer'),
('b', 'tester'),
('b', 'developer'),
('c','developer'),
('c', 'architect')]
where the adjacency matrix will be in the form of
developer tester architect
a 1 0 0
b 1 1 0
c 1 0 1
I want to store the matrix in the following format
1 0 0
1 1 0
1 0 1
I've tried it using GraphX
def pageHash(title:String ) = title.toLowerCase.replace(" ","").hashCode.toLong
val edges: RDD[Edge[String]] = sc.textFile("/user/query.csv").map { line =>
val row = line.split(",")
Edge(pageHash(row(0)), pageHash(row(1)), "1")
}
val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1)
I'm able to create the graph but not able to convert to adjacent matrix representation.
One possible way to approach is something this:
Convert RDD to DataFrame
val rdd = sc.parallelize(Seq(
("a", "developer"), ("b", "tester"), ("b", "developer"),
("c","developer"), ("c", "architect")))
val df = rdd.toDF("row", "col")
Index columns:
import org.apache.spark.ml.feature.StringIndexer
val indexers = Seq("row", "col").map(x =>
new StringIndexer().setInputCol(x).setOutputCol(s"${x}_idx").fit(df)
)
Transform data and create RDD[MatrixEntry]:
import org.apache.spark.functions.lit
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix}
val entries = indexers.foldLeft(df)((df, idx) => idx.transform(df))
.select($"row_idx", $"col_idx", lit(1.0))
.as[MatrixEntry] // Spark 1.6. For < 1.5 map manually
.rdd
Create matrix
new CoordinateMatrix(entries)
This matrix can be further converted to any other type of distributed matrix including RowMatrix and IndexedRowMatrix.
I am trying to use the new ML libraries with Spark and Dataframes for building a recommender with implicit ratings.
My code
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.ml.recommendation import ALS
sc = SparkContext()
sqlContext = SQLContext(sc)
# create the dataframe (user x item)
df = sqlContext.createDataFrame(
[(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)],
["user", "item"])
als = ALS() \
.setRank(10) \
.setImplicitPrefs(True)
model = als.fit(df)
print "Rank %i " % model.rank
model.userFactors.orderBy("id").collect()
test = sqlContext.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", "item"])
predictions = sorted(model.transform(test).collect(), key=lambda r: r[0])
for p in predictions: print p
However, I run in this error
pyspark.sql.utils.AnalysisException: cannot resolve 'rating' given input columns user, item;
So, Not sure how to define the data frame
It appears you are trying to use (user, product) tuples, but you need (user, product, rating) triplets. Even for implicit ratings, you do need the ratings. You can use a constant like 1.0 if they are all the same.
I am confused because the MLLIB API has a separate API call for implicit
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
val alpha = 0.01
val lambda = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)
I am struggling with some very basic spark code. I would like to define a matrix x with 2 columns. This is what I have tried:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0) // in this case I want s to be both column 1 and column 2 of x
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val mat = new RowMatrix(ss, 5, 2)
<console>:17: error: type mismatch;
found : Array[Double]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val mat = new RowMatrix(ss, 5, 2)
I do not understand how I can get the right transformation in order to pass the values to the distributed matrix ^
EDIT:
Maybe I have been able to solve:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0)
scala> val ss = s.to
toArray toDenseMatrix toDenseVector toScalaVector toString
toVector
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> val x = new breeze.linalg.Dense
DenseMatrix DenseVector
scala> val x = new breeze.linalg.DenseMatrix(5, 2, ss)
x: breeze.linalg.DenseMatrix[Double] =
-3.0 -3.0
-1.5 -1.5
0.0 0.0
1.5 1.5
3.0 3.0
scala> val xDist = sc.parallelize(x.toArray)
xDist: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:18
Something like this. This typechecks, but for some reason won't run in my Scala worksheet.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
// the values for the column in each row
val col = List(-3.0, -1.5, 0.0, 1.5, 3.0) ;
// make two rows of the column values, transpose it,
// make Vectors of the result
val t = List(col,col).transpose.map(r=>Vectors.dense(r.toArray))
// make an RDD from the resultant sequence of Vectors, and
// make a RowMatrix from that.
val rm = new RowMatrix(sc.makeRDD(t));