Scala map with dependent variables - scala

In scala I have a list of functions that return a value. The order in which the functions are executed are important since the argument of function n is the output of function n-1.
This hints to use foldLeft, something like:
val base: A
val funcs: Seq[Function[A, A]]
funcs.foldLeft(base)(x, f) => f(x)
(detail: type A is actually a Spark DataFrame).
However, the results of each functions are mutually exclusive and in the end I want the union of all the results for each function.
This hints to use a map, something like:
funcs.map(f => f(base)).reduce(_.union(_)
But here each function is applied to base which is not what I want.
Short: A list of variable length of ordered functions needs to return a list of equal length of return values, where each value n-1 was the input for function n (starting from base where n=0). Such that the result values can be concatenated.
How can I achieve this?
EDIT
example:
case class X(id:Int, value:Int)
val base = spark.createDataset(Seq(X(1, 1), X(2, 2), X(3, 3), X(4, 4), X(5, 5))).toDF
def toA = (x: DataFrame) => x.filter('value.mod(2) === 1).withColumn("value", lit("a"))
def toB = (x: DataFrame) => x.withColumn("value", lit("b"))
val a = toA(base)
val remainder = base.join(a, Seq("id"), "leftanti")
val b = toB(remainder)
a.union(b)
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 3| a|
| 5| a|
| 2| b|
| 4| b|
+---+-----+
This should work for an arbitrary number of functions (e.g. toA, toB ... toN. Where each time the remainder of the previous result is calculated and passed into the next function. In the end a union is applied to all results.

Seq already has a method scanLeft that does this out-of-the-box:
funcs.scanLeft(base)((acc, f) => f(acc)).tail
Make sure to drop the first element of the result of scanLeft if you don't want base to be included.
Using only foldLeft it is possible too:
funcs.foldLeft((base, List.empty[A])){ case ((x, list), f) =>
val res = f(x)
(res, res :: list)
}._2.reverse.reduce(_.union(_))
Or:
funcs.foldLeft((base, Vector.empty[A])){ case ((x, list), f) =>
val res = f(x)
(res, list :+ res)
}._2.reduce(_.union(_))
The trick is to accumulate into a Seq inside the fold.
Example:
scala> val base = 7
base: Int = 7
scala> val funcs: List[Int => Int] = List(_ * 2, _ + 3)
funcs: List[Int => Int] = List($$Lambda$1772/1298658703#7d46af18, $$Lambda$1773/107346281#5470fb9b)
scala> funcs.foldLeft((base, Vector.empty[Int])){ case ((x, list), f) =>
| val res = f(x)
| (res, list :+ res)
| }._2
res8: scala.collection.immutable.Vector[Int] = Vector(14, 17)
scala> .reduce(_ + _)
res9: Int = 31

I've got a simplified solution using normal collections but the same principle applies.
val list: List[Int] = List(1, 2, 3, 4, 5)
val funcs: Seq[Function[List[Int], List[Int]]] = Seq(times2, by2)
funcs.foldLeft(list) { case(collection, func) => func(collection) } foreach println // prints 1 2 3 4 5
def times2(l: List[Int]): List[Int] = l.map(_ * 2)
def by2(l: List[Int]): List[Int] = l.map(_ / 2)
This solution does not hold if you want a single reduced value as your final output e.g. single Int; therefore this works as:
F[B] -> F[B] -> F[B] and not as F[B] -> F[B] -> B; though I guess this is what you need.

Related

How to reduce multiple case when in scala-spark

Newbie question, how do you optimize/reduce expressions like these:
when(x1._1,x1._2).when(x2._1,x2._2).when(x3._1,x3._2).when(x4._1,x4._2).when(x5._1,x5._2)....
.when(xX._1,xX._2).otherwise(z)
The x1, x2, x3, xX are maps where x1._1 is the condition and x._2 is the "then".
I was trying to save the maps in a list and then use a map-reduce but it was producing a:
when(x1._1,x1._2).otherwise(z) && when(x2._1,x2._2).otherwise(z)...
Which is wrong. I have like 10 lines of pure when case and would like to reduce that so my code is more clear.
You can use foldLeft on the maplist:
val maplist = List(x1, x2) // add more x if needed
val new_col = maplist.tail.foldLeft(when(maplist.head._1, maplist.head._2))((x,y) => x.when(y._1, y._2)).otherwise(z)
An alternative is to use coalesce. If the condition is not met, null is returned by the when statement, and the next when statement will be evaluated until a non-null result is obtained.
val new_col = coalesce((maplist.map(x => when(x._1, x._2)) :+ z):_*)
You could create a simple recursive method to assemble the nested-when/otherwise condition:
import org.apache.spark.sql.Column
def nestedCond(cols: Array[String], default: String): Column = {
def loop(ls: List[String]): Column = ls match {
case Nil => col(default)
case c :: tail => when(col(s"$c._1"), col(s"$c._2")).otherwise(loop(tail))
}
loop(cols.toList).as("nested-cond")
}
Testing the method:
val df = Seq(
((false, 1), (false, 2), (true, 3), 88),
((false, 4), (true, 5), (true, 6), 99)
).toDF("x1", "x2", "x3", "z")
val cols = df.columns.filter(_.startsWith("x"))
// cols: Array[String] = Array(x1, x2, x3)
df.select(nestedCond(cols, "z")).show
// +-----------+
// |nested-cond|
// +-----------+
// | 3|
// | 5|
// +-----------+
Alternatively, use foldRight to assemble the nested-condition:
def nestedCond(cols: Array[String], default: String): Column =
cols.foldRight(col(default)){ (c, acc) =>
when(col(s"$c._1"), col(s"$c._2")).otherwise(acc)
}.as("nested-cond")
Another way by passing the otherwise as initial value for foldLeft:
val maplist = Seq(Map(col("c1") -> "value1"), Map(col("c2") -> "value2"))
val newCol = maplist.flatMap(_.toSeq).foldLeft(lit("z")) {
case (acc, (cond, value)) => when(cond, value).otherwise(acc)
}
// gives:
// newCol: org.apache.spark.sql.Column = CASE WHEN c2 THEN value2 ELSE CASE WHEN c1 THEN value1 ELSE z END END

Scala conditional fold in Tuple(int,int)

I have a list of tuples (int,int) such as
(100,3), (130,3), (160,1), (180,2), (200,2)
I want to foldRight or something efficient where the neighbors are compared. For ((A1,A2),(B1,B2)), we do a merge only when A2 is less than or equal to B2. Otherwise, we do not fold the list at that instance. If we merge, we retain (A1,A2) and add a count field.
The sample output is
(100,3,2) and (160,1,3)
here 2 and 3 are the weight of the observations folded to this one observation.
(100,3), (130,3)
will lead to (100,3,2)
while
(160,1), (180,2), (200,2)
will lead to (160,1,3)
Any idea how to write it in scala functional style?
scala>def conditionalFold(in: List[(Int, Int)]): List[(Int, Int, Int)] =
| in.foldLeft(Nil: List[(Int, Int, Int)]) { (acc, i) =>
| acc match {
| case Nil =>
| (i._1, i._2, 1) :: Nil
| case head :: tail =>
| if (i._2 >= head._2)
| (head._1, head._2, head._3 + 1) :: tail
| else
| (i._1, i._2, 1) :: head :: tail
| }
| }.reverse
conditionalFold: (in: List[(Int, Int)])List[(Int, Int, Int)]
scala> println(conditionalFold(List((100, 3), (130, 3), (160, 1), (180, 2), (200, 2))))
List((100,3,2), (160,1,3))
scala> println(conditionalFold(List((100,3), (130,3))))
List((100,3,2))
scala> println(conditionalFold(List((160,1), (180,2), (200,2))))
List((160,1,3))

How to process cogroup values?

I am cogrouping two RDDs and I want to process its values. That is,
rdd1.cogroup(rdd2)
as a result of this cogrouping I get results as below:
(ion,(CompactBuffer(100772C121, 100772C111, 6666666666),CompactBuffer(100772C121)))
Considering this result I would like to obtain all distinct pairs. e.g.
For the key 'ion'
100772C121 - 100772C111
100772C121 - 666666666
100772C111 - 666666666
How can I do this in scala?
You could try something like the following:
(l1 ++ l2).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
You would need to update l1 and l2 for your CompactBuffer fields. When I tried this locally, I get this (which is what I believe you want):
scala> val l1 = List("100772C121", "100772C111", "6666666666")
l1: List[String] = List(100772C121, 100772C111, 6666666666)
scala> val l2 = List("100772C121")
l2: List[String] = List(100772C121)
scala> val combine = (l1 ++ l2).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
combine: List[(String, String)] = List((100772C121,100772C111), (100772C121,6666666666), (100772C111,6666666666))
If you would like all of these pairs on separate rows, you can enclose this logic within a flatMap.
EDIT: Added steps per your example above.
scala> val rdd1 = sc.parallelize(Array(("ion", "100772C121"), ("ion", "100772C111"), ("ion", "6666666666")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> val rdd2 = sc.parallelize(Array(("ion", "100772C121")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> val cgrp = rdd1.cogroup(rdd2).flatMap {
| case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
| (l1.toSeq ++ l2.toSeq).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
| }
cgrp: org.apache.spark.rdd.RDD[(String, String)] = FlatMappedRDD[4] at flatMap at <console>:16
scala> cgrp.foreach(println)
...
(100772C121,100772C111)
(100772C121,6666666666)
(100772C111,6666666666)
EDIT 2: Updated again per your use case.
scala> val cgrp = rdd1.cogroup(rdd2).flatMap {
| case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
| for { e1 <- l1.toSeq; e2 <- l2.toSeq; if (e1 != e2) }
| yield if (e1 > e2) ((e1, e2), 1) else ((e2, e1), 1)
| }.reduceByKey(_ + _)
...
((6666666666,100772C121),2)
((6666666666,100772C111),1)
((100772C121,100772C111),1)

Does Scala have a statement equivalent to ML's "as" construct?

In ML, one can assign names for each element of a matched pattern:
fun findPair n nil = NONE
| findPair n (head as (n1, _))::rest =
if n = n1 then (SOME head) else (findPair n rest)
In this code, I defined an alias for the first pair of the list and matched the contents of the pair. Is there an equivalent construct in Scala?
You can do variable binding with the # symbol, e.g.:
scala> val wholeList # List(x, _*) = List(1,2,3)
wholeList: List[Int] = List(1, 2, 3)
x: Int = 1
I'm sure you'll get a more complete answer later as I'm not sure how to write it recursively like your example, but maybe this variation would work for you:
scala> val pairs = List((1, "a"), (2, "b"), (3, "c"))
pairs: List[(Int, String)] = List((1,a), (2,b), (3,c))
scala> val n = 2
n: Int = 2
scala> pairs find {e => e._1 == n}
res0: Option[(Int, String)] = Some((2,b))
OK, next attempt at direct translation. How about this?
scala> def findPair[A, B](n: A, p: List[Tuple2[A, B]]): Option[Tuple2[A, B]] = p match {
| case Nil => None
| case head::rest if head._1 == n => Some(head)
| case _::rest => findPair(n, rest)
| }
findPair: [A, B](n: A, p: List[(A, B)])Option[(A, B)]

What's the relation of fold on Option, Either etc and fold on Traversable?

Scalaz provides a method named fold for various ADTs such as Boolean, Option[_], Validation[_, _], Either[_, _] etc. This method basically takes functions corresponding to all possible cases for that given ADT. In other words, a pattern match shown below:
x match {
case Case1(a, b, c) => f(a, b, c)
case Case2(a, b) => g(a, b)
.
.
case CaseN => z
}
is equivalent to:
x.fold(f, g, ..., z)
Some examples:
scala> (9 == 8).fold("foo", "bar")
res0: java.lang.String = bar
scala> 5.some.fold(2 *, 2)
res1: Int = 10
scala> 5.left[String].fold(2 +, "[" +)
res2: Any = 7
scala> 5.fail[String].fold(2 +, "[" +)
res6: Any = 7
At the same time, there is an operation with the same name for the Traversable[_] types, which traverses over the collection performing certain operation on its elements, and accumulating the result value. For example,
scala> List(2, 90, 11).foldLeft("Contents: ")(_ + _.toString + " ")
res9: java.lang.String = "Contents: 2 90 11 "
scala> List(2, 90, 11).fold(0)(_ + _)
res10: Int = 103
scala> List(2, 90, 11).fold(1)(_ * _)
res11: Int = 1980
Why are these two operations identified with the same name - fold/catamorphism? I fail to see any similarities/relation between the two. What am I missing?
I think the problem you are having is that you see these things based on their implementation, not their types. Consider this simple representation of types:
List[A] = Nil
| Cons head: A tail: List[A]
Option[A] = None
| Some el: A
Now, let's consider Option's fold:
fold[B] = (noneCase: => B, someCase: A => B) => B
So, on Option, it reduces every possible case to some value in B, and return that. Now, let's see the same thing for List:
fold[B] = (nilCase: => B, consCase: (A, List[A]) => B) => B
Note, however, that we have a recursive call there, on List[A]. We have to fold that somehow, but we know fold[B] on a List[A] will always return B, so we can rewrite it like this:
fold[B] = (nilCase: => B, consCase: (A, B) => B) => B
In other words, we replaced List[A] by B, because folding it will always return a B, given the type signature of fold. Now, let's see Scala's (use case) type signature for foldRight:
foldRight[B](z: B)(f: (A, B) ⇒ B): B
Say, does that remind you of something?
If you think of "folding" as "condensing all the values in a container through an operation, with a seed value", and you think of an Option as a container that can can have at most one value, then this starts to make sense.
In fact, foldLeft has the same signature and gives you exactly the same results if you use it on an empty list vs None, and on a list with only one element vs Some:
scala> val opt : Option[Int] = Some(10)
opt: Option[Int] = Some(10)
scala> val lst : List[Int] = List(10)
lst: List[Int] = List(10)
scala> opt.foldLeft(1)((a, b) => a + b)
res11: Int = 11
scala> lst.foldLeft(1)((a, b) => a + b)
res12: Int = 11
fold is also defined on both List and Option in the Scala standard library, with the same signature (I believe they both inherit it from a trait, in fact). And again, you get the same results on a singleton list as on Some:
scala> opt.fold(1)((a, b) => a * b)
res25: Int = 10
scala> lst.fold(1)((a, b) => a * b)
res26: Int = 10
I'm not 100% sure about the fold from Scalaz on Option/Either/etc, you raise a good point there. It seems to have quite a different signature and operation from the "folding" I'm used to.