UPDATED:
I have the following representation :
scala> val B = sc.parallelize(Seq(Vectors.dense(5.0, 1.0, -2.0), Vectors.dense(10.0, 2.0, -9.0), Vectors.dense(12.0, 8.0, 2.0)))
B: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[222] at parallelize at <console>:48
I represented it as a RowMatrix by doing the following:
scala> val rowB = new RowMatrix(B)
rowB: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix#4c1838e
I would like to know how to represent this as a denseMatrix in Breeze
Is that possible?
Im a novice coder in spark so I am looking for materials that can lead me understanding if the function is possible since I would like to do a MATRIX MULTIPLICATION of a ROWMATRIX and a DenseMatrix?
Related
I am using spark with scala and trying to do the following.
I have two dense vectors(created using Vectors.dense), and I need to find the dot product of these. How could I accomplish this?
Also, I am creating the vectors based on an input file which is comma seperated. However some values are missing. Is there an easy way to read these values as zero instead of null when I am creating the vectors?
For example:
input file: 3,1,,,2
created vector: 3,1,0,0,2
Spark vectors are just wrappers for arrays, internally they get converted to Breeze arrays for vector/matrix operations. You can do just that manually to get the dot product:
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector}
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
val dv1: Vector = Vectors.dense(1.0, 0.0, 3.0)
val bdv1 = new BDV(dv1.toArray)
val dv2: Vector = Vectors.dense(2.0, 0.0, 0.0)
val bdv2 = new BDV(dv2.toArray)
scala> bdv1 dot bdv2
res3: Double = 2.0
For your second question, you can do something like this:
val v: String = "3,1,,,2"
scala> v.split("\\,").map(r => if (r == "") 0 else r.toInt)
res4: Array[Int] = Array(3, 1, 0, 0, 2)
I am familiar with Python and I am learning Spark-Scala.
I want to build a DataFrame which has structure desribed by this syntax:
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
(1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),
(3.0, Vectors.dense(1.3, 1.0)),
(1.0, Vectors.dense(1.2, -0.5))
)).toDF("label", "features")
I got the above syntax from this URL:
http://spark.apache.org/docs/latest/ml-pipeline.html
Currently my data is in array which I had pulled out of a DF:
val my_a = gspc17_df.collect().map{row => Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}
The structure of my array is very similar to the above DF:
my_a: Array[Seq[Any]] =
Array(
List(-1.4830674013266898, [-0.004192832940431825,-0.003170667657263393]),
List(-0.05876766500768526, [-0.008462913654529357,-0.006880595828929472]),
List(1.0109273250546658, [-3.1816797620416693E-4,-0.006502619326182358]))
How to copy data from my array into a DataFrame which has the above structure?
I tried this syntax:
val my_df = spark.createDataFrame(my_a).toDF("label","features")
Spark barked at me:
<console>:105: error: inferred type arguments [Seq[Any]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
<console>:105: error: type mismatch;
found : scala.collection.mutable.WrappedArray[Seq[Any]]
required: Seq[A]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
scala>
The first problem here is that you use List to store row data. List is a homogeneous data structure and since the only common type for Any (row(2)) and DenseVector is Any (Object) you end up with a Seq[Any].
The next issue is that you use row(2) at all. Since Row is effectively a collection of Any this operation doesn't return any useful type and result couldn't be stored in a DataFrame without providing an explicit Encoder.
From the more Sparkish perspective it is not the good approach neither. collect-int just to transform data shouldn't require any comment and. mapping over Rows just to create Vectors doesn't make much sense either.
Assuming that there is no type mismatch you can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array(df.columns(3), df.columns(4)))
.setOutputCol("features")
assembler.transform(df).select(df.columns(2), "features")
or if you really want to handle this manually an UDF.
val toVec = udf((x: Double, y: Double) => Vectors.dense(x, y))
df.select(col(df.columns(2)), toVec(col(df.columns(3)), col(df.columns(4))))
In general I would strongly recommend getting familiar with Scala before you start using it with Spark.
I'm trying to perform an aggregation using mapGroups that returns a SparseMatrix as one of the columns, and sum the columns.
I created a case class schema for the mapped rows in order to provide column names. The matrix column is typed org.apache.spark.mllib.linalg.Matrix. If I don't run toDF before performing the aggregation (select(sum("mycolumn")) I get one type mismatch error (required: org.apache.spark.sql.TypedColumn[MySchema,?]). If I include toDF I get another type mismatch error: cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT. So what's the right way to do it?
It looks you struggle with at least two distinct problems here. Lets assume you have Dataset like this:
val ds = Seq(
("foo", Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))),
("foo", Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)))
).toDS
Selecting TypedColumn:
using implicit conversions with $:
ds.select(col("_1").as[String])
using o.a.s.sql.functions.col:
ds.select(col("_1").as[String])
Adding matrices:
MLLib Matrix and MatrixUDT don't implement addition. It means you won't be able to sum function or reduce with +
you can use third party linear algebra library but this is not supported in Spark SQL / Spark Dataset
If you really want to do it with Datsets you can try to do something like this:
ds.groupByKey(_._1).mapGroups(
(key, values) => {
val matrices = values.map(_._2.toArray)
val first = matrices.next
val sum = matrices.foldLeft(first)(
(acc, m) => acc.zip(m).map { case (x, y) => x + y }
)
(key, sum)
})
and map back to matrices but personally I would just convert to RDD and use breeze.
I am trying to use Spark summary statistics as described at: https://spark.apache.org/docs/1.1.0/mllib-statistics.html
According to Spark docs :
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.DenseVector
val observations: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
I have a problem building observations:RDD[Vector] object. I try:
scala> val data:Array[Double] = Array(1, 2, 3, 4, 5)
data: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val v = new DenseVector(data)
v: org.apache.spark.mllib.linalg.DenseVector = [1.0,2.0,3.0,4.0,5.0]
scala> val observations = sc.parallelize(Array(v))
observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] = ParallelCollectionRDD[3] at parallelize at <console>:19
scala> val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
<console>:21: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
Questions:
1) How should I cast DenseVector to Vector?
2) In real program instead of array of doubles I have a to get statistics on a collection that I get from RDD using:
def countByKey(): Map[K, Long]
//Count the number of elements for each key, and return the result to the master as a Map.
So I have to do:
myRdd.countByKey().values.map(_.toDouble)
Which does not make much sense because instead of working with RDDs I now have to work with regular Scala collections whiich at some time stop fitting into memory. All advantages of Spark distributed computations is lost.
How to solve this in scalable manner?
Update
In my case I have:
val cnts: org.apache.spark.rdd.RDD[Int] = prodCntByCity.map(_._2) // get product counts only
val doubleCnts: org.apache.spark.rdd.RDD[Double] = cnts.map(_.toDouble)
How to convert doubleCnts into observations: RDD[Vector] ?
1) You don't need to cast, you just need to type:
val observations = sc.parallelize(Array(v: Vector))
2) Use aggregateByKey (map all the keys to to 1, and reduce by summing) rather than countByKey.
DenseVector has a compressed function. so you can change the RDD[ DenseVector] to RDD[Vector] as :
val st =observations.map(x=>x.compressed)
Using the Scala Breeze library :
How can I convert an instance of a breeze.linalg.DenseMatrix of Int values to a DenseMatrix of Doubles (both matrices have the same dimensions)?
(I am trying to get a image/picture in a matrix for image processing using Breeze)
fotNelton's answer works. Another option is:
dm.mapValues(_.toInt)
or
dm.values.map(_.toInt)
As of Breeze 0.6, you can also say:
convert(dm, Int)
You can use DenseMatrix.tabulate for this:
scala> val dm = DenseMatrix((1.0, 2.0), (3.0, 4.0))
dm: breeze.linalg.DenseMatrix[Double] =
1.0 2.0
3.0 4.0
scala> val im = DenseMatrix.tabulate(dm.rows, dm.cols)(dm(_,_).toInt)
im: breeze.linalg.DenseMatrix[Int] =
1 2
3 4