So I understand that Spark can perform iterative algorithms on single RDDs for example Logistic regression.
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
The above example is iterative because it maintains a global state w that is updated after each iteration and its updated value is used in the next iteration. Is this functionality possible in Spark streaming? Consider the same example, except now points is a DStream. In this case, you could create a new DStream that calculates the gradient with
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
But how would you handle the global state w. It seems like w would have to be a DStream too (using updateStateByKey maybe), but then its latest value would somehow need to be passed into the points map function which I don't think is possible. I don't think DStreams can communicate in this way. Am I correct, or is it possible to have iterative computations like this in Spark Streaming?
I just found out that this is quite straightforward with the foreachRDD function. MLlib actually provides models that you can train with DStreams and I found the answer in the streamingLinearAlgorithm code. It looks like you can just keep your global update variable locally in the driver and update it within the .foreachRDD so there is actually no need to transform it into a DStream itself. So you can apply this to the example I provided with something like
points.foreachRDD{(rdd,time) =>
val gradient=rdd.map(p=> (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
)).reduce(_ + _)
w -= gradient
}
Hmm... you can achieve something by parallelizing your iterator and then folding on it to update your gradient.
Also... I think you should keep Spark Streaming out of it as this problem does not look like having any feature which links it to any kind Streaming requirements.
// So, assuming... points is somehow a RDD[ Point ]
val points = sc.textFile(...).map(parsePoint).cache()
var w = Vector.random(D)
// since fold is ( T )( ( T, T) => T ) => T
val temps = sc.parallelize( 1 to ITERATIONS ).map( w )
// now fold over temps.
val gradient = temps.fold( w )( ( acc, v ) => {
val gradient = points.map( p =>
(1 / (1 + exp(-p.y*(acc dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
acc - gradient
}
Related
I'm working on a large graph in GraphX, and I want to calculate the global clustering coefficient. I'm using a function from the book Spark GraphX in Action, which is :
def clusteringCoefficient[VD:ClassTag,ED:ClassTag](g:Graph[VD,ED]) = {
val numTriplets = g.aggregateMessages[Set[VertexId]](
et => { et.sendToSrc(Set(et.dstId));
et.sendToDst(Set(et.srcId)) },
(a,b) => a ++ b) // #A
.map(x => {val s = (x._2 - x._1).size; s*(s-1) / 2})
.reduce((a,b) => a + b)
println(numTriplets)
if (numTriplets == 0) 0.0 else
g.triangleCount.vertices.map(_._2).reduce(_ + _) /
numTriplets.toFloat
}
I canonicalise the graph and partition it before running the algorithm, but for some graphs I get a negative clustering coefficient which is impossible. I put the print statement in the function just for debugging and for those graphs I get a negative number for numTriplets.
I'm not very experienced with scala so I can't see if there is a bug in the implementation.
Any help would be appreciated!
Given the following:
val rdd = List(1,2,3)
I assumed that rdd.reduce((x,y) => (x - y)) would return -4 (i.e. (1-2)-3=-4), but it returned 2.
Why?
From the RDD source code (and docs):
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
reduce is a monoidal reduction, thus it assumes the function is commutative and associative, meaning that the order of applying it to the elements is not guaranteed.
Obviously, your function (x,y)=>(x-y) isn't commutative nor associative.
In your case, the reduce might have been applied this way:
3 - (2 - 1) = 2
or
1 - (2 - 3) = 2
You can easy replace subtraction v1 - v2 - ... - vN with v1 - (v2 + ... + vN), so your code can look like
val v1 = 1
val values = Seq(2, 3)
val sum = sc.paralellize(values).reduce(_ + _)
val result = v1 - sum
As aforementioned by #TzachZohar the function must satisfy the two properties so that the parallel computation is sound; by collecting the rdd, reduce relaxes the properties required in the function, and so it produces the result from a sequential (non parallel) computation, namely,
val rdd = sc.parallelize(1 to 3)
rdd.collect.reduce((x,y) => (x-y))
Int = -4
I have the following method that computes the probability of a value in a DataSet:
/**
* Compute the probabilities of each value on the given [[DataSet]]
*
* #param x single colum [[DataSet]]
* #return Sequence of probabilites for each value
*/
private[this] def probs(x: DataSet[Double]): Seq[Double] = {
val counts = x.groupBy(_.doubleValue)
.reduceGroup(_.size.toDouble)
.name("X Probs")
.collect
val total = counts.sum
counts.map(_ / total)
}
The problem is that when I submit my flink job, that uses this method, its causing flink to kill the job due to a task TimeOut. I am executing this method for each attribute on a DataSet with only 40.000 instances and 9 attributes.
Is there a way I could do this code more efficient?
After a few tries, I made it work with mapPartition, this method is part of a class InformationTheory, which does some computations to calculate Entropy, mutual information etc. So, for example, SymmetricalUncertainty is computed as this:
/**
* Computes 'symmetrical uncertainty' (SU) - a symmetric mutual information measure.
*
* It is defined as SU(X, y) = 2 * (IG(X|Y) / (H(X) + H(Y)))
*
* #param xy [[DataSet]] with two features
* #return SU value
*/
def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
val su = xy.mapPartitionWith {
case in ⇒
val x = in map (_._2)
val y = in map (_._1)
val mu = mutualInformation(x, y)
val Hx = entropy(x)
val Hy = entropy(y)
Some(2 * mu / (Hx + Hy))
}
su.collect.head.head
}
With this, I can compute efficiently entropy, mutual information etc. The catch is, it only works with a level of parallelism of 1, the problem resides in mapPartition.
Is there a way I could do something similar to what I am doing here with SymmetricalUncertainty, but with whatever level of parallelism?
I finally did it, don't know if its the best solution, but its working with n levels of parallelism:
def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
val su = xy.reduceGroup { in ⇒
val invec = in.toVector
val x = invec map (_._2)
val y = invec map (_._1)
val mu = mutualInformation(x, y)
val Hx = entropy(x)
val Hy = entropy(y)
2 * mu / (Hx + Hy)
}
su.collect.head
}
You can check the entire code at InformationTheory.scala, and its tests InformationTheorySpec.scala
Im trying to understand how this FFT algorithm works. http://rosettacode.org/wiki/Fast_Fourier_transform#Scala
def _fft(cSeq: Seq[Complex], direction: Complex, scalar: Int): Seq[Complex] = {
if (cSeq.length == 1) {
return cSeq
}
val n = cSeq.length
assume(n % 2 == 0, "The Cooley-Tukey FFT algorithm only works when the length of the input is even.")
val evenOddPairs = cSeq.grouped(2).toSeq
val evens = _fft(evenOddPairs map (_(0)), direction, scalar)
val odds = _fft(evenOddPairs map (_(1)), direction, scalar)
def leftRightPair(k: Int): Pair[Complex, Complex] = {
val base = evens(k) / scalar
val offset = exp(direction * (Pi * k / n)) * odds(k) / scalar
(base + offset, base - offset)
}
val pairs = (0 until n/2) map leftRightPair
val left = pairs map (_._1)
val right = pairs map (_._2)
left ++ right
}
def fft(cSeq: Seq[Complex]): Seq[Complex] = _fft(cSeq, Complex(0, 2), 1)
def rfft(cSeq: Seq[Complex]): Seq[Complex] = _fft(cSeq, Complex(0, -2), 2)
val data = Seq(Complex(1,0), Complex(1,0), Complex(1,0), Complex(1,0),
Complex(0,0), Complex(0,2), Complex(0,0), Complex(0,0))
println(fft(data))
Result
Vector(4.000 + 2.000i, 2.414 + 1.000i, -2.000, 2.414 + 1.828i, 2.000i, -0.414 + 1.000i, 2.000, -0.414 - 3.828i)
Does the input take left and right channel data in complex pairs? Does it returns frequency intensity and phase offset? Time/frequency domain is in the index?
The discrete Fourier transform does not have a notion of left and right channels. It takes a time domain signal as a complex valued sequence and transforms it to a frequency domain (spectral) representation of that signal. Most time domain signals are real valued so the imaginary part is zero.
The code above is a classic recursive implementation that returns the output in bit reversed order as a complex valued pair. You need to convert the output to polar form and reorder the output array to a non-bit reversed order to make it useful for you. This code, while elegant and educational, is slow so I suggest you look for existing Java FFT libraries that suit your need.
Fourier transforms are elegant but it is worth trying to understand how they work because they have subtle side effects that can really ruin your day.
Here is some imperative code:
var sum = 0
val spacing = 6
var x = spacing
for(i <- 1 to 10) {
sum += x * x
x += spacing
}
Here are two of my attempts to "functionalize" the above code:
// Attempt 1
(1 to 10).foldLeft((0, 6)) {
case((sum, x), _) => (sum + x * x, x + spacing)
}
// Attempt 2
Stream.iterate ((0, 6)) { case (sum, x) => (sum + x * x, x + spacing) }.take(11).last
I think there might be a cleaner and better functional way to do this. What would be that?
PS: Please note that the above is just an example code intended to illustrate the problem; it is not from the real application code.
Replacing 10 by N, you have spacing * spacing * N * (N + 1) * (2 * N + 1) / 6
This is by noting that you're summing (spacing * i)^2 for the range 1..N. This sum factorizes as spacing^2 * (1^2 + 2^2 + ... + N^2), and the latter sum is well-known to be N * (N + 1) * (2 * N + 1) / 6 (see Square Pyramidal Number)
I actually like idea of lazy sequences in this case. You can split your algorithm in 2 logical steps.
At first you want to work on all natural numbers (ok.. not all, but up to max int), so you define them like this:
val naturals = 0 to Int.MaxValue
Then you need to define knowledge about how numbers, that you want to sum, can be calculated:
val myDoubles = (naturals by 6 tail).view map (x => x * x)
And putting this all together:
val naturals = 0 to Int.MaxValue
val myDoubles = (naturals by 6 tail).view map (x => x * x)
val mySum = myDoubles take 10 sum
I think it's the way mathematician will approach this problem. And because all collections are lazily evaluated - you will not get out of memory.
Edit
If you want to develop idea of mathematical notation further, you can actually define this implicit conversion:
implicit def math[T, R](f: T => R) = new {
def ∀(range: Traversable[T]) = range.view map f
}
and then define myDoubles like this:
val myDoubles = ((x: Int) => x * x) ∀ (naturals by 6 tail)
My personal favourite would have to be:
val x = (6 to 60 by 6) map {x => x*x} sum
Or given spacing as an input variable:
val x = (spacing to 10*spacing by spacing) map {x => x*x} sum
or
val x = (1 to 10) map (spacing*) map {x => x*x} sum
There are two different directions to go. If you want to express yourself, assuming that you can't use the built-in range function (because you actually want something more complicated):
Iterator.iterate(spacing)(x => x+spacing).take(10).map(x => x*x).foldLeft(0)(_ + _)
This is a very general pattern: specify what you start with and how to get the next given the previous; then take the number of items you need; then transform them somehow; then combine them into a single answer. There are shortcuts for almost all of these in simple cases (e.g. the last fold is sum) but this is a way to do it generally.
But I also wonder--what is wrong with the mutable imperative approach for maximal speed? It's really quite clear, and Scala lets you mix the two styles on purpose:
var x = spacing
val last = spacing*10
val sum = 0
while (x <= last) {
sum += x*x
x += spacing
}
(Note that the for is slower than while since the Scala compiler transforms for loops to a construct of maximum generality, not maximum speed.)
Here's a straightforward translation of the loop you wrote to a tail-recursive function, in an SML-like syntax.
val spacing = 6
fun loop (sum: int, x: int, i: int): int =
if i > 0 then loop (sum+x*x, x+spacing, i-1)
else sum
val sum = loop (0, spacing, 10)
Is this what you were looking for? (What do you mean by a "cleaner" and "better" way?)
What about this?
def toSquare(i: Int) = i * i
val spacing = 6
val spaceMultiples = (1 to 10) map (spacing *)
val squares = spaceMultiples map toSquare
println(squares.sum)
You have to split your code in small parts. This can improve readability a lot.
Here is a one-liner:
(0 to 10).reduceLeft((u,v)=>u + spacing*spacing*v*v)
Note that you need to start with 0 in order to get the correct result (else the first value 6 would be added only, but not squared).
Another option is to generate the squares first:
(1 to 2*10 by 2).scanLeft(0)(_+_).sum*spacing*spacing