I'm implementing some machine learning algorithm in Apache Spark MLlib and I would like to multiply vector with scalar:
Where u_i_j_m is a Double and x_i is a vector
I've tried the following:
import breeze.linalg.{ DenseVector => BDV, Vector => BV}
import org.apache.spark.mllib.linalg.{DenseVector, Vectors, Vector}
...
private def runAlgorithm(data: RDD[VectorWithNorm]): = {
...
data.mapPartitions { data_ponts =>
c = Array.fill(clustersNum)(BDV.zeros[Double](dim).asInstanceOf[BV[Double]])
...
data_ponts.foreach { data_point =>
...
u_i_j_m : Double = ....
val temp= data_point.vector * u_i_j_m)
// c(j) = temp
}
}
}
Where VectorWithNorm is defined as following:
class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable {
def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
def this(array: Array[Double]) = this(Vectors.dense(array))
def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm)
}
But when I build the project I get the following error:
Error: value * is not a member of org.apache.spark.mllib.linalg.Vector
val temp = (data_point.vector * u_i_j_m)
How can I do this multiplication correctly?
Unfortunately the Spark-Scala contributors decided that they will not pick a library for underlying computations i.e. linear algebra, in Scala. Under the hood they use breeze, but scalar * and + on Spark Vector's are private, as well as other useful methods. This is quite different than python where you can use excellent numpy linear algebra library. The argument was that developers are stretched thin, that breeze was suspicious because development stalled (if I remember correctly), there was an alternative (apache.commons.math), so they decided to let the users pick which linalg library you want to use in Scala. But, prompted by some members of the community, there is now a spark-package which provides linear algebra on org.apache.spark.mllib.linalg.Vector - see here.
In your code you are using sparks's Vector trait instead of breeze's DenseVector, that's why there is no * operator defined on your data_point.vector member.
Related
Problem: Let val v = List(0.5, 1.2, 0.3) model a realisation of some vector v = (v_1, v_2, v_3). The index j in v_j is implied by the element's position in the list. A lot of boilerplate and bugs have been due to tracking these indices, eg, when creating modified lists from the original. (How to make sure (in compile-time) that collection wasn't reordered? seems related.)
General question: What could be a good way to ensure correct indexing at compile time? (I assume that performance is not essential.)
My plan is to use subclasses of Nat in shapeless to model the indices as (explicit) types. The current solution is
import shapeless._
import Nat._
trait Elem[+A, +N<:Nat]{
val v: A
val ind: N}
case class DecElem[N<:Nat](v: BigDecimal, ind: N) extends Elem[BigDecimal, N]
object Decimals {
type One = DecElem[ _0]:: HNil
type Two = DecElem[ _0]:: DecElem[_1] :: HNil
//...
}
case class Scalar(v: Decimals.One)
case class VecTwo(v: Decimals.Two)
This, however, gets tedious in larger dimensions.
Another question is how to approach the generic case in trait Elem[+A, +N<:Nat]. As a start, I defined case class ElemVector[A, M<:Nat](vs: Sized[List[Elem[A, Nat]], M]), which loses the specific index type in Elem. What might be a strategy to circumvent this difficulty?
(Note: A better design may be to wrap List[A] by attaching explicit indices and deal with wrapper class. This, however, does not essentially change the question.)
UPDATE Here's is a trivial illustration of accidental swapping of vector elements.
import shapeless.Nat._
import shapeless.{Sized, nat}
type VectorTwo = Sized[IndexedSeq[Double], nat._2]
val f = (p: VectorTwo, x: Double) => {
val V = p(_0)
val K = p(_1)
V * x/(K + x)
}
val V : Double = 500
val K : Double = 1
val correct: VectorTwo = Sized(V, K)
val wrong: VectorTwo = Sized(K, V) //Compiles!
f(correct, 10) // = 454.54
f(wrong, 10) // = 0.02
I'd like to enlist the compiler to prevent such errors in vectors with many elements, and wonder if there could be an elegant solution with Shapeless.
It is pretty easy to normalize vectors in Scala (scala.collection.immutable.Vector) using map:
val w = Vector(3,4,5)
/** L1 norm: **/
val w_normalized = w.map { _/w.sum }
But you can't perform the same thing with org.apache.spark.mllib.linalg.Vector: if you try it, you get an error:
error: value map is not a member of org.apache.spark.mllib.linalg.Vector
So, what is a way to normalize spark vectors?
I thought I needed to parameterise my function across all Ordering[_] types. But that doesn't work.
How can I make the following function work for all types that support the required mathematical operations, and how could I have found that out by myself?
/**
* Given a list of positive values and a candidate value, round the candidate value
* to the nearest value in the list of buckets.
*
* #param buckets
* #param candidate
* #return
*/
def bucketise(buckets: Seq[Int], candidate: Int): Int = {
// x <= y
buckets.foldLeft(buckets.head) { (x, y) =>
val midPoint = (x + y) / 2f
if (candidate < midPoint) x else y
}
}
I tried command clicking on the mathematical operators (/, +) in intellij, but just got a notice Sc synthetic function.
If you want to use just the scala standard library, look at Numeric[T]. In your case, since you want to do a non-integer division, you would have to use the Fractional[T] subclass of Numeric.
Here is how the code would look using scala standard library typeclasses. Note that Fractional extends from Ordered. This is convenient in this case, but it is also not mathematically generic. E.g. you can't define a Fractional[T] for Complex because it is not ordered.
def bucketiseScala[T: Fractional](buckets: Seq[T], candidate: T): T = {
// so we can use integral operators such as + and /
import Fractional.Implicits._
// so we can use ordering operators such as <. We do have a Ordering[T]
// typeclass instance because Fractional extends Ordered
import Ordering.Implicits._
// integral does not provide a simple way to create an integral from an
// integer, so this ugly hack
val two = (implicitly[Fractional[T]].one + implicitly[Fractional[T]].one)
buckets.foldLeft(buckets.head) { (x, y) =>
val midPoint = (x + y) / two
if (candidate < midPoint) x else y
}
}
However, for serious generic numerical computations I would suggest taking a look at spire. It provides a much more elaborate hierarchy of numerical typeclasses. Spire typeclasses are also specialized and therefore often as fast as working directly with primitives.
Here is how to use example would look using spire:
// imports all operator syntax as well as standard typeclass instances
import spire.implicits._
// we need to provide Order explicitly, since not all fields have an order.
// E.g. you can define a Field[Complex] even though complex numbers do not
// have an order.
def bucketiseSpire[T: Field: Order](buckets: Seq[T], candidate: T): T = {
// spire provides a way to get the typeclass instance using the type
// (standard practice in all libraries that use typeclasses extensively)
// the line below is equivalent to implicitly[Field[T]].fromInt(2)
// it also provides a simple way to convert from an integer
// operators are all enabled using the spire.implicits._ import
val two = Field[T].fromInt(2)
buckets.foldLeft(buckets.head) { (x, y) =>
val midPoint = (x + y) / two
if (candidate < midPoint) x else y
}
}
Spire even provides automatic conversion from integers to T if there exists a Field[T], so you could even write the example like this (almost identical to the non-generic version). However, I think the example above is easier to understand.
// this is how it would look when using all advanced features of spire
def bucketiseSpireShort[T: Field: Order](buckets: Seq[T], candidate: T): T = {
buckets.foldLeft(buckets.head) { (x, y) =>
val midPoint = (x + y) / 2
if (candidate < midPoint) x else y
}
}
Update: spire is very powerful and generic, but can also be somewhat confusing to a beginner. Especially when things don't work. Here is an excellent blog post explaining the basic approach and some of the issues.
This a follow-up to my previous question
Travis Brown pointed out that java.util.Random is side-effecting and suggested a random monad Rng library to make the code purely functional. Now I am trying to build a simplified random monad by myself to understand how it works.
Does it make sense ? How would you fix/improve the explanation below ?
Random Generator
First we plagiarize a random generating function from java.util.Random
// do some bit magic to generate a new random "seed" from the given "seed"
// and return both the new "seed" and a random value based on it
def next(seed: Long, bits: Int): (Long, Int) = ...
Note that next returns both the new seed and the value rather than just the value. We need it to pass the new seed to another function invocation.
Random Point
Now let's write a function to generate a random point in a unit square.
Suppose we have a function to generate a random double in range [0, 1]
def randomDouble(seed: Long): (Long, Double) = ... // some bit magic
Now we can write a function to generate a random point.
def randomPoint(seed: Long): (Long, (Double, Double)) = {
val (seed1, x) = randomDouble(seed)
val (seed2, y) = randomDouble(seed1)
(seed2, (x, y))
}
So far, so good and both randomDouble and randomPoint are pure. The only problem is that we compose randomDouble to build randomPoint ad hoc. We don't have a generic tool to compose functions yielding random values.
Monad Random
Now we will define a generic tool to compose functions yielding random values. First, we generalize the type of randomDouble:
type Random[A] = Long => (Long, A) // generate a random value of type A
and then build a wrapper class around it.
class Random[A](run: Long => (Long, A))
We need the wrapper to define methods flatMap (as bind in Haskell) and map used by for-comprehension.
class Random[A](run: Long => (Long, A)) {
def apply(seed: Long) = run(seed)
def flatMap[B](f: A => Random[B]): Random[B] =
new Random({seed: Long => val (seed1, a) = run(seed); f(a)(seed1)})
def map[B](f: A => B): Random[B] =
new Random({seed: Long = val (seed1, a) = run(seed); (seed1, f(a))})
}
Now we add a factory-function to create a trivial Random[A] (which is absolutely deterministic rather than "random", by the way) This is a return function (as return in Haskell).
def certain[A](a: A) = new Random({seed: Long => (seed, a)})
Random[A] is a computation yielding random value of type A. The methods flatMap , map, and function unit serves for composing simple computations to build more complex ones. For example, we will compose two Random[Double] to build Random[(Double, Double)].
Monadic Random Point
Now when we have a monad we are ready to revisit randomPoint and randomDouble. Now we define them differently as functions yielding Random[Double] and Random[(Double, Double)]
def randomDouble(): Random[Double] = new Random({seed: Long => ... })
def randomPoint(): Random[(Double, Double)] =
randomDouble().flatMap(x => randomDouble().flatMap(y => certain(x, y))
This implementation is better than the previous one since it uses a generic tool (flatMap and certain) to compose two calls of Random[Double] and build Random[(Double, Double)].
Now can re-use this tool to build more functions generating random values.
Monte-Carlo calculation of Pi
Now we can use map to test if a random point is in the circle:
def randomCircleTest(): Random[Boolean] =
randomPoint().map {case (x, y) => x * x + y * y <= 1}
We can also define a Monte-Carlo simulation in terms of Random[A]
def monteCarlo(test: Random[Boolean], trials: Int): Random[Double] = ...
and finally the function to calculate PI
def pi(trials: Int): Random[Double] = ....
All those functions are pure. Side-effects occur only when we finally apply the pi function to get the value of pi.
Your approach is quite nice although it is a little bit complicated. I also suggested you to take a look at Chapter 6 of Functional Programming in Scala By Paul Chiusano and Runar Bjarnason.
The chapter is called Purely Functional State and it shows how to create purely functional random generator and define its data type algebra on top of that to have full function composition support.
Is it possible to enforce the size of a Vector passed in to a method at compile time? I want to model an n-dimensional Euclidean space using a collection of points in the space that looks something like this (this is what I have now):
case class EuclideanPoint(coordinates: Vector[Double]) {
def distanceTo(desination: EuclieanPoint): Double = ???
}
If I have a coordinate that is created via EuclideanPoint(Vector(1, 0, 0)), it is a 3D Euclidean point. Given that, I want to make sure the destination point passed in a call to distanceTo is of the same dimension.
I know I can do this by using Tuple1 to Tuple22, but I want to represent many different geometric spaces and I would be writing 22 classes for each space if I did it with Tuples - is there a better way?
It is possible to do this in a number of ways that all look more or less like what Randall Schulz has described in a comment. The Shapeless library provides a particularly convenient implementation, which lets you get something pretty close to what you want like this:
import shapeless._
case class EuclideanPoint[N <: Nat](
coordinates: Sized[IndexedSeq[Double], N] { type A = Double }
) {
def distanceTo(destination: EuclideanPoint[N]): Double =
math.sqrt(
(this.coordinates zip destination.coordinates).map {
case (a, b) => (a - b) * (a - b)
}.sum
)
}
Now you can write the following:
val orig2d = EuclideanPoint(Sized(0.0, 0.0))
val unit2d = EuclideanPoint(Sized(1.0, 1.0))
val orig3d = EuclideanPoint(Sized(0.0, 0.0, 0.0))
val unit3d = EuclideanPoint(Sized(1.0, 1.0, 1.0))
And:
scala> orig2d distanceTo unit2d
res0: Double = 1.4142135623730951
scala> orig3d distanceTo unit3d
res1: Double = 1.7320508075688772
But not:
scala> orig2d distanceTo unit3d
<console>:15: error: type mismatch;
found : EuclideanPoint[shapeless.Nat._3]
required: EuclideanPoint[shapeless.Nat._2]
orig2d distanceTo unit3d
^
Sized comes with a number of nice features, including a handful of collections operations that carry along static guarantees about length. We can write the following for example:
val somewhere = EuclideanPoint(Sized(0.0) ++ Sized(1.0, 0.0))
And have an ordinary old point in three-dimensional space.
You could do something your self by doing a type level encoding of the Natural Numbers like: http://apocalisp.wordpress.com/2010/06/08/type-level-programming-in-scala/. Then just parametrizing your Vector by a Natural. Would not require the extra dependency, but would probably be more complicated then using Shapeless.