Continuing a running number in Scala - scala

Feel free to edit the title of this post.
Given these:
case class Foo(amount: Int, running: Int)
val seed = List(Foo(10, 10), Foo(5, 15), Foo(10, 25))
val next = List(20, 10, 15)
How do I map next into List(Foo(20, 45), Foo(10, 55), Foo(15, 70)) the Scala way? As you can see it continues the running number.

Here's a couple different approaches, both scanLeft based:
val initial = seed.map(_.amount).sum
next
.zip(next.scanLeft(initial)(_ + _))
.map((Foo.apply _).tupled)
and
val initial = Foo(0, seed.map(_.amount).sum)
next
.scanLeft(initial){
case (Foo(_, total), n) =>
Foo(n, total + n)}
.tail
you might also consider solving it using recursion.

Related

Spark groupBy X then sortBy Y then get topK

case class Tomato(name:String, rank:Int)
case class Potato(..)
I have Spark 2.4 and Dataset[Tomato, Potato] that I want to groupBy name and get topK ranks.
Issue is that groupBy produces an iterator which is not sortable and iterator.toList explodes on large datasets.
Iterator solution:
data.groupByKey{ case (tomato,_) => tomato.name }
.flatMapGroups((k,it)=>it.toList.sortBy(_.rank).take(topK))
I've also tried aggregation functions but I could not find a topK or firstK only first and last.
Another thing I hate about aggregation functions is that they convert the dataset to a dataframe (yuck) so all the types are gone.
Aggregation Fn solution syntax made up by me:
data.agg(row_number.over(Window.partitionBy("_1.name").orderBy("_1.rank").take(topK))
There are already several questions on SO that ask for groupBy then some other operation but none want to sort by a key different than the groupBy key and then get topK
You could go the iterator route without having to create a full list which indeed explodes with big datasets. Something like:
import spark.implicits._
import scala.util.Sorting
case class Tomato(name:String, rank:Int)
case class Potato(taste: String)
case class MyClass(tomato: Tomato, potato: Potato)
val ordering = Ordering.by[MyClass, Int](_.tomato.rank)
val ds = Seq(
(MyClass(Tomato("tomato1", 1), Potato("tasty"))),
(MyClass(Tomato("tomato1", 2), Potato("tastier"))),
(MyClass(Tomato("tomato2", 2), Potato("tastiest"))),
(MyClass(Tomato("tomato3", 2), Potato("yum"))),
(MyClass(Tomato("tomato3", 4), Potato("yummier"))),
(MyClass(Tomato("tomato3", 50), Potato("yummiest"))),
(MyClass(Tomato("tomato7", 50), Potato("yam")))
).toDS
val k = 2
val output = ds
.groupByKey{
case MyClass(tomato, potato) => tomato.name
}
.mapGroups(
(name, iterator)=> {
val topK = iterator.foldLeft(Seq.empty[MyClass]){
(accumulator, element) => {
val newAccumulator = accumulator :+ element
if (newAccumulator.length > k)
newAccumulator.sorted(ordering).drop(1)
else
newAccumulator
}
}
(name, topK)
}
)
output.show(false)
+-------+--------------------------------------------------------+
|_1 |_2 |
+-------+--------------------------------------------------------+
|tomato7|[[[tomato7, 50], [yam]]] |
|tomato2|[[[tomato2, 2], [tastiest]]] |
|tomato1|[[[tomato1, 1], [tasty]], [[tomato1, 2], [tastier]]] |
|tomato3|[[[tomato3, 4], [yummier]], [[tomato3, 50], [yummiest]]]|
+-------+--------------------------------------------------------+
So as you see, for each Tomato.name key, we're keeping the k elements with the largest Tomato.rank values. You get a Dataset[(String, Seq(MyClass))] as result.
This is not really optimized for performance: for each group, we're iterating over all of its elements and sorting the sequence which could become quite intensive computationally. But this all depends on the size of your actual case classes, the size of your data, your requirements, ...
Hope this helps!
Issue is that groupBy produces an iterator which is not sortable and iterator.toList explodes on large datasets.
What you could do is to come up with a topK() method that takes parameters k, Iterator[A] and a A => B mapping to return an Iterator[A] of top k elements (sorted by value of type B) -- all without having to sort the entire iterator:
def topK[A, B : Ordering](k: Int, iter: Iterator[A], f: A => B): Iterator[A] = {
val orderer = implicitly[Ordering[B]]
import orderer._
val listK = iter.take(k).toList
iter.foldLeft(listK.sortWith(f(_) > f(_))){ (lsK, x) =>
if (f(x) < f(lsK.head))
(x :: lsK.tail).sortWith(f(_) > f(_))
else
lsK
}.reverse.iterator
}
Note that topK() only involves iterative sorting of lists of size k, with the assumption k is small compared with the size of the input iterator. If necessary, it could be further optimized to eliminate the sorting of the k-elements lists by only making its first element the largest element while leaving the rest of the lists unsorted.
Using your groupByKey approach, method topK() can be plugged in within flatMapGroups as shown below:
case class T(name: String, rank: Int)
case class P(name: String, rank: Int)
val ds = Seq(
(T("t1", 4), P("p1", 1)),
(T("t1", 5), P("p2", 2)),
(T("t1", 1), P("p3", 3)),
(T("t1", 3), P("p4", 4)),
(T("t1", 2), P("p5", 5)),
(T("t2", 4), P("p6", 6)),
(T("t2", 2), P("p7", 7)),
(T("t2", 6), P("p8", 8))
).toDF("tomato", "potato").as[(T, P)]
val k = 3
ds.
groupByKey{ case (tomato, _) => tomato.name }.
flatMapGroups((_, it) => topK[(T, P), Int](k, it, { case (t, p) => t.rank })).
show
/*
+-------+-------+
| _1| _2|
+-------+-------+
|{t1, 1}|{p3, 3}|
|{t1, 2}|{p5, 5}|
|{t1, 3}|{p4, 4}|
|{t2, 2}|{p7, 7}|
|{t2, 4}|{p6, 6}|
|{t2, 6}|{p8, 8}|
+-------+-------+
*/

Best way to handle while loop style program in Scala

I have a simple program written in C++. It generates random numbers, and stop when the sum of those numbers equal or greater than 100. The code looks like:
vector<int> container;
while(container.sum() <100)
{
int new_number = rand()%10 + 1 ;// generate a number in range 1 to 10
container.push_back(new_number); // add new number to the container
}
What is the best way to handle the task in Scala? (without using while loop?)
It seems that FoldLeft or FoldRight function doesn't have ability to break at a condition?
Create an infinite Stream of random numbers (requires very little CPU and memory), take only what you need, then turn the result Stream into the desired collection type.
val randoms = Stream.continually(util.Random.nextInt(10)+1)
val container = randoms.take(randoms.scan(0)(_+_).indexWhere(_>=100)).toVector
Added bonus is that the sums are calculated as you go, i.e. added to the previous sum, not summing from the beginning each time.
Here's one approach to handle a while loop using a tail-recursive function as follows:
#scala.annotation.tailrec
def addToContainer(container: Vector[Int], max: Int): Vector[Int] = {
val newContainer = container ++ Vector(scala.util.Random.nextInt(10) + 1)
if (newContainer.sum >= max) container
else addToContainer(newContainer, max)
}
addToContainer(Vector[Int](), 100)
// res1: Vector[Int] = Vector(9, 9, 5, 9, 3, 5, 2, 5, 10, 7, 6, 4, 5, 5, 9, 3)
res1.sum
// res2: Int = 96
Here's one way to do it:
val randomNumberGenerator = new scala.util.Random
def sumUntil(list: List[Int]): List[Int] = list match {
case exceeds if list.filter(_ > 0).sum > 100 => list
case _ => sumUntil(list :+ (randomNumberGenerator.nextInt(10) + 1))
}
To explain the solution:
Create an instance of scala.util.Random which will help us generate random numbers
sumUntil will pattern match; if the sum of the list exceeds 100, return it.
In the event that the sum does not exceed 100, call sumUntil again, but with another random number generated between 10 and 1 (inclusive). Keep in mind that the _ means, "I don't care about the value, or even the type." _ is anything else but the case where the sum of all the integers in our list is greater than 100.
If you're new to Scala, I understand that it may be a bit rough on the eyes to read. Below is a refined version:
val randomNumberGenerator = new scala.util.Random
def sumUntil(list: List[Int]): List[Int] = list match {
case exceeds if sumList(list) > 100 => list
case _ => sumUntil(appendRandomNumberToList(list))
}
private def sumList(list: List[Int]): Int = {
list.filter(_ > 0).sum
}
private def appendRandomNumberToList(list: List[Int]): List[Int] = {
list :+ randomNumberGenerator.nextInt(10) + 1
}
If your loop just scans through the collection, use fold or reduce.
If it needs some custom terminate condition, recursion is favored.

Functional way to map over a list with an accumulator in Scala

I would like to write succinct code to map over a list, accumulating a value as I go and using that value in the output list.
Using a recursive function and pattern matching this is straightforward (see below). But I was wondering if there is a way to do this using the function programming family of combinators like map and fold etc. Obviously map and fold are no good unless you use a mutable variable defined outside the call and modify that in the body.
Perhaps I could do this with a State Monad but was wondering if there is a way to do it that I'm missing, and that utilizes the Scala standard library.
// accumulate(List(10, 20, 20, 30, 20))
// => List(10, 30, 50, 80, 100,)
def accumulate(weights : List[Int], sum : Int = 0, acc: List[Int] = List.empty) : List[Int] = {
weights match {
case hd :: tl =>
val total = hd + sum
accumulate(tl, total, total :: acc)
case Nil =>
acc.reverse
}
}
You may also use foldLeft:
def accumulate(seq: Seq[Int]) =
seq.foldLeft(Vector.empty[Int]) { (result, e) =>
result :+ result.lastOption.getOrElse(0) + e
}
accumulate(List(10, 20, 20, 30, 20))
// => List(10, 30, 50, 80, 100,)
This could be done with scan:
val result = list.scanLeft(0){case (acc, item) => acc+item}
Scan will include the initial value 0 into output so you have to drop it:
result.drop(1)
As pointed out in #Nyavro's answer, the operation you are looking for (the sum of the prefixes of the list) is called prefix-sum and its generalization to any binary operation is called scan and is included in the Scala standard library:
val l = List(10, 20, 20, 30, 20)
l.scan(0) { _ + _ }
//=> List(0, 10, 30, 50, 80, 100)
l.scan(0)(_ + _).drop(1)
//=> List(10, 30, 50, 80, 100)
This has already been answered, but I wanted to address a misconception in your question:
Obviously map and fold are no good unless you use a mutable variable defined outside the call and modify that in the body.
That is not true. fold is a general method of iteration. Everything you can do by iterating over a collection, you can do with fold. If fold were the only method in your List class, you could still do everything you can do now. Here's how to solve your problem with fold:
l.foldLeft(List(0)) { (list, el) ⇒ list.head + el :: list }.reverse.drop(1)
And a general implementation of scan:
def scan[A](l: List[A])(z: A)(op: (A, A) ⇒ A) =
l.
drop(1).
foldLeft(List(l.head)) { (list, el) ⇒ op(list.head, el) :: list }.
reverse
Think of it this way: a collection can be either empty or not. fold has two arguments, one which tells it what to do when the list is empty, and one which tells it what to do when the list is not empty. Those are the only two cases, so every possible case is handled. Therefore, fold can do anything! (More precisely in Scala, foldLeft and foldRight can do anything, while fold is restricted to associative operations.)
Or a different viewpoint: a collection is a stream of instructions, either the EMPTY instruction or the ELEMENT(value) instruction. foldLeft / foldRight are skeleton interpreters for that instruction set, and you as a programmer can supply the implementation for the interpretation of both those instructions, namely the two arguments to foldLeft / foldRight are the interpretation of those instructions.
Remember: while foldLeft / foldRight reduces a collection to a single value, that value can be arbitrarily complex, including being a collection itself!

Split sorted Scala Sequence/Array according to gaps between elements [duplicate]

What is the most elegant way of grouping a list of values into groups based on their neighbor values?
The wider context I have is having a list of lines, that need to be grouped into paragraphs. I want to be able to say that if the vertical difference between two lines is lower than threshold, they are in the same paragraph.
I ended up solving this problem differently, but I'm wondering about the correct solution here.
case class Box(y: Int)
val list = List(Box(y=1), Box(y=2), Box(y=5))
def group(list: List[Box], threshold: Int): List[List[Box]] = ???
val grouped = group(list, 2)
> List(List(Box(y=1), Box(y=2)), List(Box(y=5)))
I have looked at groupBy(), but that can only work with one element at a time. I have also tried an approach that involved pre-computing differences using sliding(), but then it becomes awkward to retrieve the elements from the original collection.
It's a one liner. Generalising types left as an exercise for the reader.
Using ints and absolute difference rather than lines and spacing to avoid clutter.
val zs = List(1,2,4,8,9,10,15,16)
def closeEnough(a:Int, b:Int) = (Math.abs(b -a) <= 2)
zs.drop(1).foldLeft(List(List(zs.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
.map(_.reverse)
.reverse
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))
Or a two liner for a slight efficiency gain
val ys = zs.reverse
ys.drop(1).foldLeft(List(List(ys.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))

Grouping list items by comparing them with their neighbors

What is the most elegant way of grouping a list of values into groups based on their neighbor values?
The wider context I have is having a list of lines, that need to be grouped into paragraphs. I want to be able to say that if the vertical difference between two lines is lower than threshold, they are in the same paragraph.
I ended up solving this problem differently, but I'm wondering about the correct solution here.
case class Box(y: Int)
val list = List(Box(y=1), Box(y=2), Box(y=5))
def group(list: List[Box], threshold: Int): List[List[Box]] = ???
val grouped = group(list, 2)
> List(List(Box(y=1), Box(y=2)), List(Box(y=5)))
I have looked at groupBy(), but that can only work with one element at a time. I have also tried an approach that involved pre-computing differences using sliding(), but then it becomes awkward to retrieve the elements from the original collection.
It's a one liner. Generalising types left as an exercise for the reader.
Using ints and absolute difference rather than lines and spacing to avoid clutter.
val zs = List(1,2,4,8,9,10,15,16)
def closeEnough(a:Int, b:Int) = (Math.abs(b -a) <= 2)
zs.drop(1).foldLeft(List(List(zs.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
.map(_.reverse)
.reverse
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))
Or a two liner for a slight efficiency gain
val ys = zs.reverse
ys.drop(1).foldLeft(List(List(ys.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))