Faster implementation for reduceByKey on Seq of pairs possible?

Faster implementation for reduceByKey on Seq of pairs possible? - scala

The code below contains various single-threaded implementations of reduceByKeyXXX methods and a few helper methods to create input sets and measure execution times. (Feel free to run the main-method)
The main purpose of reduceByKey (as in Spark) is to reduce key-value-pairs with the same key. Example:
scala> val xs = Seq( "a" -> 2, "b" -> 3, "a" -> 5)
xs: Seq[(String, Int)] = List((a,2), (b,3), (a,5))
scala> ReduceByKeyComparison.reduceByKey(xs, (x:Int, y:Int) ⇒ x+y )
res8: Seq[(String, Int)] = ArrayBuffer((b,3), (a,7))
Code
import java.util.HashMap
object Util {
def measure( body : => Unit ) : Long = {
val now = System.currentTimeMillis
body
val nowAfter = System.currentTimeMillis
nowAfter - now
}
def measureMultiple( body: => Unit, n: Int) : String = {
val executionTimes = (1 to n).toList.map( x => {
print(".")
measure(body)
} )
val avg = executionTimes.sum / executionTimes.size
executionTimes.mkString("", "ms, ", "ms") + s" Average: ${avg}ms."
}
}
object RandomUtil {
val AB = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
val r = new java.util.Random();
def randomString( len: Int ) : String = {
val sb = new StringBuilder( len );
for( i <- 0 to len-1 ) {
sb.append(AB.charAt(r.nextInt(AB.length())));
}
sb.toString();
}
def generateSeq(n: Int) : Seq[(String, Int)] = {
Seq.fill(n)( (randomString(2), r.nextInt(100)) )
}
}
object ReduceByKeyComparison {
def main(args: Array[String]) : Unit = {
implicit def iterableToPairedIterable[K, V](x: Iterable[(K, V)]) = { new PairedIterable(x) }
val runs = 10
val problemSize = 2000000
val ss = RandomUtil.generateSeq(problemSize)
println("ReduceByKey : " + Util.measureMultiple( reduceByKey(ss, (x:Int, y:Int) ⇒ x+y ), runs ))
println("ReduceByKey2: " + Util.measureMultiple( reduceByKey2(ss, (x:Int, y:Int) ⇒ x+y ), runs ))
println("ReduceByKey3: " + Util.measureMultiple( reduceByKey3(ss, (x:Int, y:Int) ⇒ x+y ), runs ))
println("ReduceByKeyPaired: " + Util.measureMultiple( ss.reduceByKey( (x:Int, y:Int) ⇒ x+y ), runs ))
println("ReduceByKeyA: " + Util.measureMultiple( reduceByKeyA( ss, (x:Int, y:Int) ⇒ x+y ), runs ))
}
// =============================================================================
// Different implementations
// =============================================================================
def reduceByKey[A,B]( s: Seq[(A,B)], fnc: (B, B) ⇒ B) : Seq[(A,B)] = {
val t = s.groupBy(x => x._1)
val u = t.map { case (k,v) => (k, v.map(_._2).reduce(fnc))}
u.toSeq
}
def reduceByKey2[A,B]( s: Seq[(A,B)], fnc: (B, B) ⇒ B) : Seq[(A,B)] = {
val r = s.foldLeft( Map[A,B]() ){ (m,a) ⇒
val k = a._1
val v = a._2
m.get(k) match {
case Some(pv) ⇒ m + ((k, fnc(pv, v)))
case None ⇒ m + ((k, v))
}
}
r.toSeq
}
def reduceByKey3[A,B]( s: Seq[(A,B)], fnc: (B, B) ⇒ B) : Seq[(A,B)] = {
var m = scala.collection.mutable.Map[A,B]()
s.foreach{ e ⇒
val k = e._1
val v = e._2
m.get(k) match {
case Some(pv) ⇒ m(k) = fnc(pv, v)
case None ⇒ m(k) = v
}
}
m.toSeq
}
/**
* Method code from [[http://ideone.com/dyrkYM]]
* All rights to Muhammad-Ali A'rabi according to [[https://issues.scala-lang.org/browse/SI-9064]]
*/
def reduceByKeyA[A,B]( s: Seq[(A,B)], fnc: (B, B) ⇒ B): Map[A, B] = {
s.groupBy(_._1).map(l => (l._1, l._2.map(_._2).reduce( fnc )))
}
/**
* Method code from [[http://ideone.com/dyrkYM]]
* All rights to Muhammad-Ali A'rabi according to [[https://issues.scala-lang.org/browse/SI-9064]]
*/
class PairedIterable[K, V](x: Iterable[(K, V)]) {
def reduceByKey(func: (V,V) => V) = {
val map = new HashMap[K, V]
x.foreach { pair =>
val old = map.get(pair._1)
map.put(pair._1, if (old == null) pair._2 else func(old, pair._2))
}
map
}
}
}
yielding the following results on my machine
..........ReduceByKey : 723ms, 782ms, 761ms, 617ms, 640ms, 707ms, 634ms, 611ms, 380ms, 458ms Average: 631ms.
..........ReduceByKey2: 580ms, 458ms, 452ms, 463ms, 462ms, 470ms, 463ms, 465ms, 458ms, 462ms Average: 473ms.
..........ReduceByKey3: 489ms, 466ms, 461ms, 468ms, 555ms, 474ms, 469ms, 457ms, 461ms, 468ms Average: 476ms.
..........ReduceByKeyPaired: 140ms, 124ms, 124ms, 120ms, 122ms, 124ms, 118ms, 126ms, 121ms, 119ms Average: 123ms.
..........ReduceByKeyA: 628ms, 694ms, 666ms, 656ms, 616ms, 660ms, 594ms, 659ms, 445ms, 399ms Average: 601ms.
and ReduceByKeyPaired currently being the fastest.
Question / Task
Is there a faster single-threaded (Scala) implementation?

Rewritting reduceByKey method of PairedIterable to recursion gives around 5-10% performance improvement.
That all i was able to get.
I've also tryed to increase initial capacity allocation for HashMap - but it does not show any significant changes.
class PairedIterable[K, V](x: Iterable[(K, V)]) {
def reduceByKey(func: (V,V) => V) = {
val map = new HashMap[K, V]()
#tailrec
def reduce(it: Iterable[(K, V)]): HashMap[K, V] = {
it match {
case Nil => map
case (k, v) :: tail =>
val old = map.get(k)
map.put(k, if (old == null) v else func(old, v))
reduce(tail)
}
}
val r = reduce(x)
r
}
}
In general, making some comparison analysis of provided methods - they can be splitted onto two categories.
First set of reduces are with sorting (grouping) - as we can see those methods add extra O(n*log[n]) complexity and are not effective for this scenario.
Seconds are with linear looping across all enries of Iterable. Those set of methods has extra get/put operations to temp map. But those gets/puts are not so time consuming - O(n)*O(c).
Moreover necessity to work with Options in scala collections makes it less effective.

Related

How to create an Akka Stream Source that generates items recursively

I'm trying to figure out how to create an Akka Streams source that generates many Seq[Int].
Basically, given an int n I want to generate all of the Seq[Int] of 1 to n
Here's some code that does this:
def combinations(n: Int): Seq[Seq[Int]] = {
def loop(acc: (Seq[Int], Seq[Seq[Int]]),
remaining: Seq[Int]): Seq[Seq[Int]] = {
remaining match {
case s if s.size == 1 => {
val total: Seq[Seq[Int]] = acc._2
val current: Seq[Int] = acc._1
total :+ (current :+ s.head)
}
case _ => {
for {
x <- remaining
comb <- loop((acc._1 :+ x, acc._2), remaining.filter(_ != x))
} yield comb
}
}
}
loop((Seq(), Seq()), (1 to n))
}
This works fine up to 10... then it blows up because it runs out of memory. Since I just want to process each of them and don't need to keep them all in memory, I thought... Akka Streams. But I'm at a loss for how to turn this into a Source that produces each combination so I can process them. Basically there where it's appending to total I would produce another item onto the stream.

Here is a solution that uses the Johnson-Trotter algorithm for permutations. tcopermutations creates a LazyList that can be evaluated as needed. For more permutations, just pass a different value to printNIterations.
The reason for using the Johnson-Trotter algorithm is that it breaks the recursive structure of the permutation finding algorithm. That's important for being able to evaluate successive instances of the permutation and storing them in some kind of lazy list or stream.
object PermutationsTest {
def main(args: Array[String]) = {
printNIterations(50, tcopermutations(5).iterator)
}
def printNIterations(n: Int, it: Iterator[Seq[Int]]): Unit = {
if (n<=0) ()
else {
if (it.hasNext) {
println(it.next())
printNIterations(n - 1, it)
} else ()
}
}
def naivepermutations(n: Int): Seq[Seq[Int]] = {
def loop(acc: Seq[Int], remaining: Seq[Int]): Seq[Seq[Int]] = {
remaining match {
case s if s.size == 1 => {
val current: Seq[Int] = acc
Seq((current :+ s.head))
}
case _ => {
for {
x <- remaining
comb <- loop(acc :+ x, remaining.filter(_ != x))
} yield comb
}
}
}
loop(Seq(), (1 to n))
}
def tcopermutations(n: Int): LazyList[Seq[Int]] = {
val start = (1 to n).map(Element(_, Left))
def loop(v: Seq[Element]): LazyList[Seq[Element]] = {
johnsonTrotter(v) match {
case Some(s) => v #:: loop(s)
case None => LazyList(v)
}
}
loop(start).map(_.map(_.i))
}
def checkIfMobile(seq: Seq[Element], i: Int): Boolean = {
val e = seq(i)
def getAdjacent(s: Seq[Element], d: Direction, j: Int): Int = {
val adjacentIndex = d match {
case Left => j - 1
case Right => j + 1
}
s(adjacentIndex).i
}
if (e.direction == Left && i == 0) false
else if (e.direction == Right && i == seq.size - 1) false
else if (getAdjacent(seq, e.direction, i) < e.i) true
else false
}
def findLargestMobile(seq: Seq[Element]): Option[Int] = {
val mobiles = (0 until seq.size).filter{j => checkIfMobile(seq, j)}
if (mobiles.isEmpty) None
else {
val folded = mobiles.map(x=>(x,seq(x).i)).foldLeft(None: Option[(Int, Int)]){ case (acc, elem) =>
acc match {
case None => Some(elem)
case Some((i, value)) => if (value > elem._2) Some((i, value)) else Some(elem)
}
}
folded.map(_._1)
}
}
def swapLargestMobile(seq: Seq[Element], index: Int): (Seq[Element], Int) = {
val dir = seq(index).direction
val value = seq(index).i
dir match {
case Right =>
val folded = seq.foldLeft((None, Seq()): (Option[Element], Seq[Element])){(acc, elem) =>
val matched = elem.i == value
val newAccOpt = if (matched) Some(elem) else None
val newAccSeq = acc._1 match {
case Some(swapMe) => acc._2 :+ elem :+ swapMe
case None => if (matched) acc._2 else acc._2 :+ elem
}
(newAccOpt, newAccSeq)
}
(folded._2, index + 1)
case Left =>
val folded = seq.foldRight((None, Seq()): (Option[Element], Seq[Element])){(elem, acc) =>
val matched = elem.i == value
val newAccOpt = if (matched) Some(elem) else None
val newAccSeq = acc._1 match {
case Some(swapMe) => swapMe +: elem +: acc._2
case None => if (matched) acc._2 else elem +: acc._2
}
(newAccOpt, newAccSeq)
}
(folded._2, index - 1)
}
}
def revDirLargerThanMobile(seq: Seq[Element], mobile: Int) = {
def reverse(e: Element) = {
e.direction match {
case Left => Element(e.i, Right)
case Right => Element(e.i, Left)
}
}
seq.map{ elem =>
if (elem.i > seq(mobile).i) reverse(elem)
else elem
}
}
def johnsonTrotter(curr: Seq[Element]): Option[Seq[Element]] = {
findLargestMobile(curr).map { m =>
val (swapped, newMobile) = swapLargestMobile(curr, m)
revDirLargerThanMobile(swapped, newMobile)
}
}
trait Direction
case object Left extends Direction
case object Right extends Direction
case class Element(i: Int, direction: Direction)
}

How to group large stream into sub streams

I want to group large Stream[F, A] into Stream[Stream[F, A]] with at most n element for inner stream.
This is what I did, basically pipe chunks into Queue[F, Queue[F, Chunk[A]], and then yields queue elements as result stream.
implicit class StreamSyntax[F[_], A](s: Stream[F, A])(
implicit F: Concurrent[F]) {
def groupedPipe(
lastQRef: Ref[F, Queue[F, Option[Chunk[A]]]],
n: Int): Pipe[F, A, Stream[F, A]] = { in =>
val initQs =
Queue.unbounded[F, Option[Queue[F, Option[Chunk[A]]]]].flatMap { qq =>
Queue.bounded[F, Option[Chunk[A]]](1).flatMap { q =>
lastQRef.set(q) *> qq.enqueue1(Some(q)).as(qq -> q)
}
}
Stream.eval(initQs).flatMap {
case (qq, initQ) =>
def newQueue = Queue.bounded[F, Option[Chunk[A]]](1).flatMap { q =>
qq.enqueue1(Some(q)) *> lastQRef.set(q).as(q)
}
val evalStream = {
in.chunks
.evalMapAccumulate((0, initQ)) {
case ((i, q), c) if i + c.size >= n =>
val (l, r) = c.splitAt(n - i)
q.enqueue1(Some(l)) >> q.enqueue1(None) >> q
.enqueue1(None) >> newQueue.flatMap { nq =>
nq.enqueue1(Some(r)).as(((r.size, nq), c))
}
case ((i, q), c) if (i + c.size) < n =>
q.enqueue1(Some(c)).as(((i + c.size, q), c))
}
.attempt ++ Stream.eval {
lastQRef.get.flatMap { last =>
last.enqueue1(None) *> last.enqueue1(None)
} *> qq.enqueue1(None)
}
}
qq.dequeue.unNoneTerminate
.map(
q =>
q.dequeue.unNoneTerminate
.flatMap(Stream.chunk)
.onFinalize(
q.dequeueChunk(Int.MaxValue).unNoneTerminate.compile.drain))
.concurrently(evalStream)
}
}
def grouped(n: Int) = {
Stream.eval {
Queue.unbounded[F, Option[Chunk[A]]].flatMap { empty =>
Ref.of[F, Queue[F, Option[Chunk[A]]]](empty)
}
}.flatMap { ref =>
val p = groupedPipe(ref, n)
s.through(p)
}
}
}
But it is very complicated, is there any simpler way ?

fs2 has chunkN chunkLimit methods that can help with grouping
stream.chunkN(n).map(Stream.chunk)
stream.chunkLimit(n).map(Stream.chunk)
chunkN produces chunks of size n until the end of a stream
chunkLimit splits existing chunks and can produce chunks with variable size.
scala> Stream(1,2,3).repeat.chunkN(2).take(5).toList
res0: List[Chunk[Int]] = List(Chunk(1, 2), Chunk(3, 1), Chunk(2, 3), Chunk(1, 2), Chunk(3, 1))
scala> (Stream(1) ++ Stream(2, 3) ++ Stream(4, 5, 6)).chunkLimit(2).toList
res0: List[Chunk[Int]] = List(Chunk(1), Chunk(2, 3), Chunk(4, 5), Chunk(6))

In addition to the already mentioned chunksN, also consider using groupWithin (fs2 1.0.1):
def groupWithin[F2[x] >: F[x]](n: Int, d: FiniteDuration)(implicit timer: Timer[F2], F: Concurrent[F2]): Stream[F2, Chunk[O]]
Divide this streams into groups of elements received within a time window, or limited by the number of the elements, whichever happens first. Empty groups, which can occur if no elements can be pulled from upstream in a given time window, will not be emitted.
Note: a time window starts each time downstream pulls.
I'm not sure why you'd want this to be nested streams, since the requirement is to have "at most n elements" in one batch - which implies that you're keeping track of a finite number of elements (which is exactly what a Chunk is for). Either way, a Chunk can always be represented as a Stream with Stream.chunk:
val chunks: Stream[F, Chunk[O]] = ???
val streamOfStreams: Stream[F, Stream[F, O]] = chunks.map(Stream.chunk)
Here's a complete example of how to use groupWithin:
import cats.implicits._
import cats.effect.{ExitCode, IO, IOApp}
import fs2._
import scala.concurrent.duration._
object GroupingDemo extends IOApp {
override def run(args: List[String]): IO[ExitCode] = {
Stream('a, 'b, 'c).covary[IO]
.groupWithin(2, 1.second)
.map(_.toList)
.showLinesStdOut
.compile.drain
.as(ExitCode.Success)
}
}
Outputs:
List('a, 'b)
List('c)

Finally I use a more reliable version (use Hotswap ensure queue termination) like this.
def grouped(
innerSize: Int
)(implicit F: Async[F]): Stream[F, Stream[F, A]] = {
type InnerQueue = Queue[F, Option[Chunk[A]]]
type OuterQueue = Queue[F, Option[InnerQueue]]
def swapperInner(swapper: Hotswap[F, InnerQueue], outer: OuterQueue) = {
val innerRes =
Resource.make(Queue.unbounded[F, Option[Chunk[A]]])(_.offer(None))
swapper.swap(innerRes).flatTap(q => outer.offer(q.some))
}
def loopChunk(
gathered: Int,
curr: Queue[F, Option[Chunk[A]]],
chunk: Chunk[A],
newInnerQueue: F[InnerQueue]
): F[(Int, Queue[F, Option[Chunk[A]]])] = {
if (gathered + chunk.size > innerSize) {
val (left, right) = chunk.splitAt(innerSize - gathered)
curr.offer(left.some) >> newInnerQueue.flatMap { nq =>
loopChunk(0, nq, right, newInnerQueue)
}
} else if (gathered + chunk.size == innerSize) {
curr.offer(chunk.some) >> newInnerQueue.tupleLeft(
0
)
} else {
curr.offer(chunk.some).as(gathered + chunk.size -> curr)
}
}
val prepare = for {
outer <- Resource.eval(Queue.unbounded[F, Option[InnerQueue]])
swapper <- Hotswap.create[F, InnerQueue]
} yield outer -> swapper
Stream.resource(prepare).flatMap {
case (outer, swapper) =>
val newInner = swapperInner(swapper, outer)
val background = Stream.eval(newInner).flatMap { initQueue =>
s.chunks
.filter(_.nonEmpty)
.evalMapAccumulate(0 -> initQueue) { (state, chunk) =>
val (gathered, curr) = state
loopChunk(gathered, curr, chunk, newInner).tupleRight({})
}
.onFinalize(swapper.clear *> outer.offer(None))
}
val foreground = Stream
.fromQueueNoneTerminated(outer)
.map(i => Stream.fromQueueNoneTerminatedChunk(i))
foreground.concurrently(background)
}
}

scala Map with Option/Some gives match error

The following code is producing run time error as below. Could reason why the following error. Please explain.
Exception in thread "main" scala.MatchError: Some(Some(List(17))) (of class scala.Some)
at com.discrete.CountingSupp$.$anonfun$tuplesWithRestrictions1$1(CountingSupp.scala:43)
def tuplesWithRestrictions1(): (Int, Map[Int, Option[List[Int]]]) = {
val df = new DecimalFormat("#")
df.setMaximumFractionDigits(0)
val result = ((0 until 1000) foldLeft[(Int, Map[Int, Option[List[Int]]])] ((0, Map.empty[Int, Option[List[Int]]]))) {
(r: (Int, Map[Int, Option[List[Int]]]), x: Int) => {
val str = df.format(x).toCharArray
if (str.contains('7')) {
import scala.math._
val v = floor(log10(x)) - 1
val v1 = (pow(10, v)).toInt
val m: Map[Int, Option[List[Int]]] = (r._2).get(v1) match {
case None => r._2 + (v1 -> Some(List(x)))
case Some(xs: List[Int]) => r._2 updated(x, Some(x::xs))
}
val f = (r._1 + 1, m)
f
} else r
}
}
result
}

Return type of .get on map is
get(k: K): Option[V]
Scala doc
/** Optionally returns the value associated with a key.
*
* #param key the key value
* #return an option value containing the value associated with `key` in this map,
* or `None` if none exists.
*/
def get(key: K): Option[V]
Now,
r._2.get(v1) returns an option of Value. So the final return type would be Option[Option[List[Int]]].
You are trying to pattern match for Option[T] but the real value type is Option[Option[Int]] which is not captured in the match.
Use r._2(v1) to extract the value and match. Throws exception when v1 is not found in map.
Match inside map providing default value.
r._2.get(k1).map {
case None => r._2 + (v1 -> Some(List(x)))
case Some(value) => r._2 updated(x, Some(x::xs))
}.getOrElse(defaultValue)

def tuplesWithRestrictions1(): (Int, Map[Int, List[Int]]) = {
val df = new DecimalFormat("#")
df.setMaximumFractionDigits(0)
val result = ((0 until 1000) foldLeft[(Int, Map[Int, List[Int]])] ((0, Map.empty[Int, List[Int]]))) {
(r: (Int, Map[Int, List[Int]]), x: Int) => {
val str = df.format(x).toCharArray
if (str.contains('7')) {
import scala.math._
val v = floor(log10(x))
val v1 = (pow(10, v)).toInt
val m: Map[Int, List[Int]] = (r._2).get(v1) match {
case Some(xs: List[Int]) => r._2 updated(v1, x :: xs)
case None => r._2 + (v1 -> List(x))
}
val f = (r._1 + 1, m)
f
} else r
}
}
result
}

Thoughts about a collection method that splits multiple times given a predicate

I am looking for a collections method which splits at a given pairwise condition, e.g.
val x = List("a" -> 1, "a" -> 2, "b" -> 3, "c" -> 4, "c" -> 5)
implicit class RichIterableLike[A, CC[~] <: Iterable[~]](it: CC[A]) {
def groupWith(fun: (A, A) => Boolean): Iterator[CC[A]] = new Iterator[CC[A]] {
def hasNext: Boolean = ???
def next(): CC[A] = ???
}
}
assert(x.groupWith(_._1 != _._1).toList ==
List(List("a" -> 1, "a" -> 2), List("b" -> 3), List("c" -> 4, "c" -> 5))
)
So this is sort of a recursive span.
While I'm capable of implementing the ???, I wonder
if something already exists in collections that I'm overseeing
what that method should be called; groupWith doesn't sound right. It should be concise, but somehow reflect that the function argument operates on pairs. groupWhere would be a bit closer I guess, but still not clear.
actually I guess when using groupWith, the predicate logic should be inverted, so I would use x.groupWith(_._1 == _._1)
thoughts about the types. Returning an Iterator[CC[A]] looks reasonable to me. Perhaps it should take a CanBuildFrom and return an Iterator[To]?

You can also write a version that uses tailrec/pattern matching:
def groupWith[A](s: Seq[A])(p: (A, A) => Boolean): Seq[Seq[A]] = {
#tailrec
def rec(xs: Seq[A], acc: Seq[Seq[A]] = Vector.empty): Seq[Seq[A]] = {
(xs.headOption, acc.lastOption) match {
case (None, _) => acc
case (Some(a), None) => rec(xs.tail, acc :+ Vector(a))
case (Some(a), Some(group)) if p(a, group.last) => rec(xs.tail, acc.init :+ (acc.last :+ a))
case (Some(a), Some(_)) => rec(xs.tail, acc :+ Vector(a))
}
}
rec(s)
}

So here is my suggestion. I sticked to groupWith, because spans is not very descriptive in my opinion. It is true that groupBy has very different semantics, however there is grouped(size: Int) which is similar.
I tried to create my iterator purely based on combining existing iterators, but this got messy, so here is the more low level version:
import scala.collection.generic.CanBuildFrom
import scala.annotation.tailrec
import language.higherKinds
object Extensions {
private final class GroupWithIterator[A, CC[~] <: Iterable[~], To](
it: CC[A], p: (A, A) => Boolean)(implicit cbf: CanBuildFrom[CC[A], A, To])
extends Iterator[To] {
private val peer = it.iterator
private var consumed = true
private var elem = null.asInstanceOf[A]
def hasNext: Boolean = !consumed || peer.hasNext
private def pop(): A = {
if (!consumed) return elem
if (!peer.hasNext)
throw new NoSuchElementException("next on empty iterator")
val res = peer.next()
elem = res
consumed = false
res
}
 
def next(): To = {
val b = cbf()
#tailrec def loop(pred: A): Unit = {
b += pred
consumed = true
if (!peer.isEmpty) {
val succ = pop()
if (p(pred, succ)) loop(succ)
}
}
loop(pop())
b.result()
}
}
 
implicit final class RichIterableLike[A, CC[~] <: Iterable[~]](val it: CC[A])
extends AnyVal {
/** Clumps the collection into groups based on a predicate which determines
* if successive elements belong to the same group.
*
* For example:
* {{
* val x = List("a", "a", "b", "a", "b", "b")
* x.groupWith(_ == _).to[Vector]
* }}
*
* produces `Vector(List("a", "a"), List("b"), List("a"), List("b", "b"))`.
*
* #param p a function which is evaluated with successive pairs of
* the input collection. As long as the predicate holds
* (the function returns `true`), elements are lumped together.
* When the predicate becomes `false`, a new group is started.
*
* #param cbf a builder factory for the group type
* #tparam To the group type
* #return an iterator over the groups.
*/
def groupWith[To](p: (A, A) => Boolean)
(implicit cbf: CanBuildFrom[CC[A], A, To]): Iterator[To] =
new GroupWithIterator(it, p)
}
}
That is, the predicate is inverted as opposed to the question.
import Extensions._
val x = List("a" -> 1, "a" -> 2, "b" -> 3, "c" -> 4, "c" -> 5)
x.groupWith(_._1 == _._1).to[Vector]
// -> Vector(List((a,1), (a,2)), List((b,3)), List((c,4), (c,5)))

You could achieve it with a fold too right? Here is an unoptimized version:
def groupWith[A](ls: List[A])(p: (A, A) => Boolean): List[List[A]] =
ls.foldLeft(List[List[A]]()) { (acc, x) =>
if(acc.isEmpty)
List(List(x))
else
if(p(acc.last.head, x))
acc.init ++ List(acc.last ++ List(x))
else
acc ++ List(List(x))
}
val x = List("a" -> 1, "a" -> 2, "b" -> 3, "c" -> 4, "c" -> 5, "a" -> 4)
println(groupWith(x)(_._1 == _._1))
//List(List((a,1), (a,2)), List((b,3)), List((c,4), (c,5)), List((a,4)))

Lazy Cartesian product of several Seqs in Scala

I implemented a simple method to generate Cartesian product on several Seqs like this:
object RichSeq {
implicit def toRichSeq[T](s: Seq[T]) = new RichSeq[T](s)
}
class RichSeq[T](s: Seq[T]) {
import RichSeq._
def cartesian(ss: Seq[Seq[T]]): Seq[Seq[T]] = {
ss.toList match {
case Nil => Seq(s)
case s2 :: Nil => {
for (e <- s) yield s2.map(e2 => Seq(e, e2))
}.flatten
case s2 :: tail => {
for (e <- s) yield s2.cartesian(tail).map(seq => e +: seq)
}.flatten
}
}
}
Obviously, this one is really slow, as it calculates the whole product at once. Did anyone implement a lazy solution for this problem in Scala?
UPD
OK, So I implemented a reeeeally stupid, but working version of an iterator over a Cartesian product. Posting here for future enthusiasts:
object RichSeq {
implicit def toRichSeq[T](s: Seq[T]) = new RichSeq(s)
}
class RichSeq[T](s: Seq[T]) {
def lazyCartesian(ss: Seq[Seq[T]]): Iterator[Seq[T]] = new Iterator[Seq[T]] {
private[this] val seqs = s +: ss
private[this] var indexes = Array.fill(seqs.length)(0)
private[this] val counts = Vector(seqs.map(_.length - 1): _*)
private[this] var current = 0
def next(): Seq[T] = {
val buffer = ArrayBuffer.empty[T]
if (current != 0) {
throw new NoSuchElementException("no more elements to traverse")
}
val newIndexes = ArrayBuffer.empty[Int]
var inside = 0
for ((index, i) <- indexes.zipWithIndex) {
buffer.append(seqs(i)(index))
newIndexes.append(index)
if ((0 to i).forall(ind => newIndexes(ind) == counts(ind))) {
inside = inside + 1
}
}
current = inside
if (current < seqs.length) {
for (i <- (0 to current).reverse) {
if ((0 to i).forall(ind => newIndexes(ind) == counts(ind))) {
newIndexes(i) = 0
} else if (newIndexes(i) < counts(i)) {
newIndexes(i) = newIndexes(i) + 1
}
}
current = 0
indexes = newIndexes.toArray
}
buffer.result()
}
def hasNext: Boolean = current != seqs.length
}
}

Here's my solution to the given problem. Note that the laziness is simply caused by using .view on the "root collection" of the used for comprehension.
scala> def combine[A](xs: Traversable[Traversable[A]]): Seq[Seq[A]] =
| xs.foldLeft(Seq(Seq.empty[A])){
| (x, y) => for (a <- x.view; b <- y) yield a :+ b }
combine: [A](xs: Traversable[Traversable[A]])Seq[Seq[A]]
scala> combine(Set(Set("a","b","c"), Set("1","2"), Set("S","T"))) foreach (println(_))
List(a, 1, S)
List(a, 1, T)
List(a, 2, S)
List(a, 2, T)
List(b, 1, S)
List(b, 1, T)
List(b, 2, S)
List(b, 2, T)
List(c, 1, S)
List(c, 1, T)
List(c, 2, S)
List(c, 2, T)
To obtain this, I started from the function combine defined in https://stackoverflow.com/a/4515071/53974, passing it the function (a, b) => (a, b). However, that didn't quite work directly, since that code expects a function of type (A, A) => A. So I just adapted the code a bit.

These might be a starting point:
Cartesian product of two lists
Expand a Set[Set[String]] into Cartesian Product in Scala
https://stackoverflow.com/questions/6182126/im-learning-scala-would-it-be-possible-to-get-a-little-code-review-and-mentori

What about:
def cartesian[A](list: List[Seq[A]]): Iterator[Seq[A]] = {
if (list.isEmpty) {
Iterator(Seq())
} else {
list.head.iterator.flatMap { i => cartesian(list.tail).map(i +: _) }
}
}
Simple and lazy ;)

def cartesian[A](list: List[List[A]]): List[List[A]] = {
list match {
case Nil => List(List())
case h :: t => h.flatMap(i => cartesian(t).map(i :: _))
}
}

You can look here: https://stackoverflow.com/a/8318364/312172 how to translate a number into an index of all possible values, without generating every element.
This technique can be used to implement a stream.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Faster implementation for reduceByKey on Seq of pairs possible? - scala

Related

How to create an Akka Stream Source that generates items recursively

How to group large stream into sub streams

scala Map with Option/Some gives match error

Thoughts about a collection method that splits multiple times given a predicate

Lazy Cartesian product of several Seqs in Scala

Categories

Resources