aggregating multiple values at once - scala

So I'm running into a speed issue where I have a dataset that needs to be aggregated multiple times.
Initially my team had set up three accumulators and were running a single foreach loop over the data. Something along the lines of
val accum1:Accumulable[a]
val accum2: Accumulable[b]
val accum3: Accumulable[c]
data.foreach{
u =>
accum1+=u
accum2 += u
accum3 += u
}
I am trying to switch these accumulations into an aggregation so that I can get a speed boost and have access to accumulators for debugging. I am currently trying to figure out a way to aggregate these three types at once, since running 3 separate aggregations is significantly slower. Does anyone have any thoughts as to how I can do this? Perhaps aggregating agnostically then pattern matching to split into two RDDs?
Thank you

As far as I can tell all you need here is aggregate with zeroValue, seqOp and combOp corresponding to the operations which are performed by your accumulators.
val zeroValue: (A, B, C) = ??? // (accum1.zero, accum2.zero, accum3.zero)
def seqOp(r: (A, B, C), t: T): (A, B, C) = r match {
case (a, b, c) => {
// Apply operations equivalent to
// accum1.addAccumulator(a, t)
// accum2.addAccumulator(c, t))
// accum3.addAccumulator(c, t)
// and return the first argument
// r
}
}
def combOp(r1: (A, B, C), r2: (A, B, C)): (A, B, C) = (r1, r2) match {
case ((a1, b1, c1), (a2, b2, c2)) => {
// Apply operations equivalent to
// acc1.addInPlace(a1, a2)
// acc2.addInPlace(b1, b2)
// acc3.addInPlace(c1, c2)
// and return the first argument
// r1
}
}
val rdd: RDD[T] = ???
val accums: (A, B, C) = rdd.aggregate(zeroValue)(seqOp, combOp)

Related

what can be best approch - Get all possible sublists

I have a List "a, b, c,d", and this is the expected result
a
ab
abc
abcd
b
bc
bcd
c
cd
d
I tried bruteforce, but I think, there might be any other efficient solution, provided I am having very long list.
Here's a one-liner.
"abcd".tails.flatMap(_.inits.toSeq.init.reverse).mkString(",")
//res0: String = a,ab,abc,abcd,b,bc,bcd,c,cd,d
The mkString() is added just so we can see the result. Otherwise the result is an Iterator[String], which is a pretty memory efficient collection type.
The reverse is only there so that it comes out in the order you specified. If the order of the results is unimportant then that can be removed.
The toSeq.init is there to remove empty elements left behind by the inits call. If those can be dealt with elsewhere then this can also be removed.
This may not be the best solution but one way of doing this is by using sliding function as follow,
val lst = List('a', 'b', 'c', 'd')
val groupedElements = (1 to lst.size).flatMap(x =>
lst.sliding(x, 1))
groupedElements.foreach(x => println(x.mkString("")))
//output
/* a
b
c
d
ab
bc
cd
abc
bcd
abcd
*/
It may not be the best solution, but I think is a good one, and it's tailrec
First this function to get the possible sublists of a List
def getSubList[A](lista: Seq[A]): Seq[Seq[A]] = {
for {
i <- 1 to lista.length
} yield lista.take(i)
}
And then this one to perform the recursion calling the first function and obtain all the sublists possible:
def getSubListRecursive[A](lista: Seq[A]): Seq[Seq[A]] = {
#tailrec
def go(acc: Seq[Seq[A]], rest: Seq[A]): Seq[Seq[A]] = {
rest match {
case Nil => acc
case l => go(acc = acc ++ getSubList(l), rest= l.tail)
}
}
go(Nil, lista)
}
The ouput:
getSubListRecursive(l)
res4: Seq[Seq[String]] = List(List(a), List(a, b), List(a, b, c), List(a, b, c, d), List(b), List(b, c), List(b, c, d), List(c), List(c, d), List(d))

map3 in scala in Parallelism

def map2[A,B,C] (a: Par[A], b: Par[B]) (f: (A,B) => C) : Par[C] =
(es: ExecutorService) => {
val af = a (es)
val bf = b (es)
UnitFuture (f(af.get, bf.get))
}
def map3[A,B,C,D] (pa :Par[A], pb: Par[B], pc: Par[C]) (f: (A,B,C) => D) :Par[D] =
map2(map2(pa,pb)((a,b)=>(c:C)=>f(a,b,c)),pc)(_(_))
I have map2 and need to produce map3 in terms of map2. I found the solution in GitHub but it is hard to understand. Can anyone put a sight on it and explain map3 and also what this does (())?
On a purely abstract level, map2 means you can run two tasks in parallel, and that is a new task in itself. The implementation provided for map3 is: run in parallel (the task that consist in running in parallel the two first ones) and (the third task).
Now down to the code: first, let's give name to all the objects created (I also extended _ notations for clarity):
def map3[A,B,C,D] (pa :Par[A], pb: Par[B], pc: Par[C]) (f: (A,B,C) => D) :Par[D] = {
def partialCurry(a: A, b: B)(c: C): D = f(a, b, c)
val pc2d: Par[C => D] = map2(pa, pb)((a, b) => partialCurry(a, b))
def applyFunc(func: C => D, c: C): D = func(c)
map2(pc2d, pc)((c2d, c) => applyFunc(c2d, c)
}
Now remember that map2 takes two Par[_], and a function to combine the eventual values, to get a Par[_] of the result.
The first time you use map2 (the inside one), you parallelize the first two tasks, and combine them into a function. Indeed, using f, if you have a value of type A and a value of type B, you just need a value of type C to build one of type D, so this exactly means that partialCurry(a, b) is a function of type C => D (partialCurry itself is of type (A, B) => C => D).
Now you have again two values of type Par[_], so you can again map2 on them, and there is only one natural way to combine them to get the final value.
The previous answer is correct but I found it easier to think about like this:
def map3[A, B, C, D](a: Par[A], b: Par[B], c: Par[C])(f: (A, B, C) => D): Par[D] = {
val f1 = (a: A, b: B) => (c: C) => f(a, b, c)
val f2: Par[C => D] = map2(a, b)(f1)
map2(f2, c)((f3: C => D, c: C) => f3(c))
}
Create a function f1 that is a version of f with the first 2 arguments partially applied, then we can map2 that with a and b to give us a function of type C => D in the Par context (f1).
Finally we can use f2 and c as arguments to map2 then apply f3(C => D) to c to give us a D in the Par context.
Hope this helps someone!

Future yielding with flatMap

Given Futures fa, fb, fc, I can use f: Function1[(A,B,C), Future[D]], to return a Future[D] either by:
(for {
a <- fa
b <- fb
c <- fc
} yield (a,b,c)).flatMap(f)
which has the unenviable property of declaring the variables a,b,c twice.
or
a.zip(b).zip(c).flatMap{ case (a, (b, c)) => f(a, b, c) }
which is terser, but the nesting of the futures into pairs of pairs is weird.
It would be great to have a form of the for-expression where the yield returns a flattened result. Is there such a thing?
There's no reason to flatMap in the yield. It should be another line in the for-comprehension.
for {
a <- fa
b <- fb
c <- fc
d <- f(a, b, c)
} yield d
I don't think it can get more concise than that.

How to compute inverse of a multi-map

I have a Scala Map:
x: [b,c]
y: [b,d,e]
z: [d,f,g,h]
I want inverse of this map for look-up.
b: [x,y]
c: [x]
d: [x,z] and so on.
Is there a way to do it without using in-between mutable maps
If its not a multi-map - Then following works:
typeMap.flatMap { case (k, v) => v.map(vv => (vv, k))}
EDIT: fixed answer to include what Marth rightfully pointed out. My answer is a bit more lenghty than his as I try to go through each step and not use the magic provided by flatMaps for educational purposes, his is more straightforward :)
I'm unsure about your notation. I assume that what you have is something like:
val myMap = Map[T, Set[T]] (
x -> Set(b, c),
y -> Set(b, d, e),
z -> Set(d, f, g, h)
)
You can achieve the reverse lookup as follows:
val instances = for {
keyValue <- myMap.toList
value <- keyValue._2
}
yield (value, keyValue._1)
At this point, your instances variable is a List of the type:
(b, x), (c, x), (b, y) ...
If you now do:
val groupedLookups = instances.groupBy(_._1)
You get:
b -> ((b, x), (b, y)),
c -> ((c, x)),
d -> ((d, y), (d, z)) ...
Now we want to reduce the values so that they only contain the second part of each pair. Therefore we do:
val reverseLookup = groupedLookup.map(_._1 -> _._2.map(_._2))
Which means that for every pair we maintain the original key, but we map the list of arguments to something that only has the second value of the pair.
And there you have your result.
(You can also avoid assigning to an intermediate result, but I thought it was clearer like this)
Here is my simplification as a function:
def reverseMultimap[T1, T2](map: Map[T1, Seq[T2]]): Map[T2, Seq[T1]] =
map.toSeq
.flatMap { case (k, vs) => vs.map((_, k)) }
.groupBy(_._1)
.mapValues(_.map(_._2))
The above was derived from #Diego Martinoia's answer, corrected and reproduced below in function form:
def reverseMultimap[T1, T2](myMap: Map[T1, Seq[T2]]): Map[T2, Seq[T1]] = {
val instances = for {
keyValue <- myMap.toList
value <- keyValue._2
} yield (value, keyValue._1)
val groupedLookups = instances.groupBy(_._1)
val reverseLookup = groupedLookups.map(kv => kv._1 -> kv._2.map(_._2))
reverseLookup
}

Functional equivalent of if (p(f(a), f(b)) a else b

I'm guessing that there must be a better functional way of expressing the following:
def foo(i: Any) : Int
if (foo(a) < foo(b)) a else b
So in this example f == foo and p == _ < _. There's bound to be some masterful cleverness in scalaz for this! I can see that using BooleanW I can write:
p(f(a), f(b)).option(a).getOrElse(b)
But I was sure that I would be able to write some code which only referred to a and b once. If this exists it must be on some combination of Function1W and something else but scalaz is a bit of a mystery to me!
EDIT: I guess what I'm asking here is not "how do I write this?" but "What is the correct name and signature for such a function and does it have anything to do with FP stuff I do not yet understand like Kleisli, Comonad etc?"
Just in case it's not in Scalaz:
def x[T,R](f : T => R)(p : (R,R) => Boolean)(x : T*) =
x reduceLeft ((l, r) => if(p(f(l),f(r))) r else l)
scala> x(Math.pow(_ : Int,2))(_ < _)(-2, 0, 1)
res0: Int = -2
Alternative with some overhead but nicer syntax.
class MappedExpression[T,R](i : (T,T), m : (R,R)) {
def select(p : (R,R) => Boolean ) = if(p(m._1, m._2)) i._1 else i._2
}
class Expression[T](i : (T,T)){
def map[R](f: T => R) = new MappedExpression(i, (f(i._1), f(i._2)))
}
implicit def tupleTo[T](i : (T,T)) = new Expression(i)
scala> ("a", "bc") map (_.length) select (_ < _)
res0: java.lang.String = a
I don't think that Arrows or any other special type of computation can be useful here. Afterall, you're calculating with normal values and you can usually lift a pure computation that into the special type of computation (using arr for arrows or return for monads).
However, one very simple arrow is arr a b is simply a function a -> b. You could then use arrows to split your code into more primitive operations. However, there is probably no reason for doing that and it only makes your code more complicated.
You could for example lift the call to foo so that it is done separately from the comparison. Here is a simiple definition of arrows in F# - it declares *** and >>> arrow combinators and also arr for turning pure functions into arrows:
type Arr<'a, 'b> = Arr of ('a -> 'b)
let arr f = Arr f
let ( *** ) (Arr fa) (Arr fb) = Arr (fun (a, b) -> (fa a, fb b))
let ( >>> ) (Arr fa) (Arr fb) = Arr (fa >> fb)
Now you can write your code like this:
let calcFoo = arr <| fun a -> (a, foo a)
let compareVals = arr <| fun ((a, fa), (b, fb)) -> if fa < fb then a else b
(calcFoo *** calcFoo) >>> compareVals
The *** combinator takes two inputs and runs the first and second specified function on the first, respectively second argument. >>> then composes this arrow with the one that does comparison.
But as I said - there is probably no reason at all for writing this.
Here's the Arrow based solution, implemented with Scalaz. This requires trunk.
You don't get a huge win from using the arrow abstraction with plain old functions, but it is a good way to learn them before moving to Kleisli or Cokleisli arrows.
import scalaz._
import Scalaz._
def mod(n: Int)(x: Int) = x % n
def mod10 = mod(10) _
def first[A, B](pair: (A, B)): A = pair._1
def selectBy[A](p: (A, A))(f: (A, A) => Boolean): A = if (f.tupled(p)) p._1 else p._2
def selectByFirst[A, B](f: (A, A) => Boolean)(p: ((A, B), (A, B))): (A, B) =
selectBy(p)(f comap first) // comap adapts the input to f with function first.
val pair = (7, 16)
// Using the Function1 arrow to apply two functions to a single value, resulting in a Tuple2
((mod10 &&& identity) apply 16) assert_≟ (6, 16)
// Using the Function1 arrow to perform mod10 and identity respectively on the first and second element of a `Tuple2`.
val pairs = ((mod10 &&& identity) product) apply pair
pairs assert_≟ ((7, 7), (6, 16))
// Select the tuple with the smaller value in the first element.
selectByFirst[Int, Int](_ < _)(pairs)._2 assert_≟ 16
// Using the Function1 Arrow Category to compose the calculation of mod10 with the
// selection of desired element.
val calc = ((mod10 &&& identity) product) ⋙ selectByFirst[Int, Int](_ < _)
calc(pair)._2 assert_≟ 16
Well, I looked up Hoogle for a type signature like the one in Thomas Jung's answer, and there is on. This is what I searched for:
(a -> b) -> (b -> b -> Bool) -> a -> a -> a
Where (a -> b) is the equivalent of foo, (b -> b -> Bool) is the equivalent of <. Unfortunately, the signature for on returns something else:
(b -> b -> c) -> (a -> b) -> a -> a -> c
This is almost the same, if you replace c with Bool and a in the two places it appears, respectively.
So, right now, I suspect it doesn't exist. It occured to me that there's a more general type signature, so I tried it as well:
(a -> b) -> ([b] -> b) -> [a] -> a
This one yielded nothing.
EDIT:
Now I don't think I was that far at all. Consider, for instance, this:
Data.List.maximumBy (on compare length) ["abcd", "ab", "abc"]
The function maximumBy signature is (a -> a -> Ordering) -> [a] -> a, which, combined with on, is pretty close to what you originally specified, given that Ordering is has three values -- almost a boolean! :-)
So, say you wrote on in Scala:
def on[A, B, C](f: ((B, B) => C), g: A => B): (A, A) => C = (a: A, b: A) => f(g(a), g(b))
The you could write select like this:
def select[A](p: (A, A) => Boolean)(a: A, b: A) = if (p(a, b)) a else b
And use it like this:
select(on((_: Int) < (_: Int), (_: String).length))("a", "ab")
Which really works better with currying and dot-free notation. :-) But let's try it with implicits:
implicit def toFor[A, B](g: A => B) = new {
def For[C](f: (B, B) => C) = (a1: A, a2: A) => f(g(a1), g(a2))
}
implicit def toSelect[A](t: (A, A)) = new {
def select(p: (A, A) => Boolean) = t match {
case (a, b) => if (p(a, b)) a else b
}
}
Then you can write
("a", "ab") select (((_: String).length) For (_ < _))
Very close. I haven't figured any way to remove the type qualifier from there, though I suspect it is possible. I mean, without going the way of Thomas answer. But maybe that is the way. In fact, I think on (_.length) select (_ < _) reads better than map (_.length) select (_ < _).
This expression can be written very elegantly in Factor programming language - a language where function composition is the way of doing things, and most code is written in point-free manner. The stack semantics and row polymorphism facilitates this style of programming. This is what the solution to your problem will look like in Factor:
# We find the longer of two lists here. The expression returns { 4 5 6 7 8 }
{ 1 2 3 } { 4 5 6 7 8 } [ [ length ] bi# > ] 2keep ?
# We find the shroter of two lists here. The expression returns { 1 2 3 }.
{ 1 2 3 } { 4 5 6 7 8 } [ [ length ] bi# < ] 2keep ?
Of our interest here is the combinator 2keep. It is a "preserving dataflow-combinator", which means that it retains its inputs after the given function is performed on them.
Let's try to translate (sort of) this solution to Scala.
First of all, we define an arity-2 preserving combinator.
scala> def keep2[A, B, C](f: (A, B) => C)(a: A, b: B) = (f(a, b), a, b)
keep2: [A, B, C](f: (A, B) => C)(a: A, b: B)(C, A, B)
And an eagerIf combinator. if being a control structure cannot be used in function composition; hence this construct.
scala> def eagerIf[A](cond: Boolean, x: A, y: A) = if(cond) x else y
eagerIf: [A](cond: Boolean, x: A, y: A)A
Also, the on combinator. Since it clashes with a method with the same name from Scalaz, I'll name it upon instead.
scala> class RichFunction2[A, B, C](f: (A, B) => C) {
| def upon[D](g: D => A)(implicit eq: A =:= B) = (x: D, y: D) => f(g(x), g(y))
| }
defined class RichFunction2
scala> implicit def enrichFunction2[A, B, C](f: (A, B) => C) = new RichFunction2(f)
enrichFunction2: [A, B, C](f: (A, B) => C)RichFunction2[A,B,C]
And now put this machinery to use!
scala> def length: List[Int] => Int = _.length
length: List[Int] => Int
scala> def smaller: (Int, Int) => Boolean = _ < _
smaller: (Int, Int) => Boolean
scala> keep2(smaller upon length)(List(1, 2), List(3, 4, 5)) |> Function.tupled(eagerIf)
res139: List[Int] = List(1, 2)
scala> def greater: (Int, Int) => Boolean = _ > _
greater: (Int, Int) => Boolean
scala> keep2(greater upon length)(List(1, 2), List(3, 4, 5)) |> Function.tupled(eagerIf)
res140: List[Int] = List(3, 4, 5)
This approach does not look particularly elegant in Scala, but at least it shows you one more way of doing things.
There's a nice-ish way of doing this with on and Monad, but Scala is unfortunately very bad at point-free programming. Your question is basically: "can I reduce the number of points in this program?"
Imagine if on and if were differently curried and tupled:
def on2[A,B,C](f: A => B)(g: (B, B) => C): ((A, A)) => C = {
case (a, b) => f.on(g, a, b)
}
def if2[A](b: Boolean): ((A, A)) => A = {
case (p, q) => if (b) p else q
}
Then you could use the reader monad:
on2(f)(_ < _) >>= if2
The Haskell equivalent would be:
on' (<) f >>= if'
where on' f g = uncurry $ on f g
if' x (y,z) = if x then y else z
Or...
flip =<< flip =<< (if' .) . on (<) f
where if' x y z = if x then y else z