I've seen many questions about Scala collections and could not decide.
This question was the most useful until now.
I think the core of the question is twofold:
1) Which are the best collections for this use case?
2) Which are the recommended ways to use them?
I am implementing an algorithm that iterates over all elements in a collection
searching for the one that matches a certain criterion.
After the search, the next step is to search again with a new criterion, but without the chosen element among the possibilities.
The idea is to create a sequence with all original elements ordered by the criterion (which changes at every new selection).
The original sequence doesn't really need to be ordered, but there can be duplicates (the algorithm will only pick one at a time).
Example with a small sequence of Ints (just to simplify):
object Foo extends App {
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt), Seq()))
println("ti = " + ti)
def recur(collection: Seq[Int], already_selected: Seq[Int]): (Seq[Int], Seq[Int]) =
if (collection.isEmpty) (Seq(), already_selected)
else {
val selected = collection maxBy f(already_selected)
val rest = collection diff Seq(selected) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)

Try #inline and as icn suggested How can I idiomatically "remove" a single element from a list in Scala and close the gap?:
object Foo extends App {
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt()).zipWithIndex, Seq()))
println("ti = " + ti)
def recur(collection: Seq[(Int, Int)], already_selected: Seq[Int]): Seq[Int] =
if (collection.isEmpty) already_selected
else {
val (selected, i) = collection.maxBy(x => f(already_selected)(x._2))
val rest = collection.patch(i, Nil, 1) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)


request timeout from flatMapping over cats.effect.IO

I am attempting to transform some data that is encapsulated in cats.effect.IO with a Map that also is in an IO monad. I'm using http4s with blaze server and when I use the following code the request times out:
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// get the shifts
var getDbShifts: IO[List[Shift]] = shiftModel.findByUserId(userId)
// use the userRoleId to get the RoleId then get the tasks for this role
val taskMap : IO[Map[String, Double]] = taskModel.findByUserId(userId).flatMap {
case tskLst: List[Task] => IO(tskLst.map((task: Task) => (task.name -> task.standard)).toMap)
val traversed: IO[List[Shift]] = for {
shifts <- getDbShifts
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: IO[List[ShiftJson]] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) =>
taskMap.flatMap((tm: Map[String, Double]) =>
IO(ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / tm.get(sj.name).get)))
//TODO: this flatMap is bricking my request
lstShiftJson.flatMap((sjLst: List[ShiftJson]) => {
IO(Shift(shift.id, shift.shiftDate, shift.shiftStart, shift.shiftEnd,
shift.lunchDuration, shift.shiftDuration, shift.breakOffProd, shift.systemDownOffProd,
shift.meetingOffProd, shift.trainingOffProd, shift.projectOffProd, shift.miscOffProd,
write[List[ShiftJson]](sjLst), shift.userRoleId, shift.isApproved, shift.score, shift.comments
} yield traversed
traversed.flatMap((sLst: List[Shift]) => Ok(write[List[Shift]](sLst)))
as you can see the TODO comment. I've narrowed down this method to the flatmap below the TODO comment. If I remove that flatMap and merely return "IO(shift)" to the traversed variable the request does not timeout; However, that doesn't help me much because I need to make use of the lstShiftJson variable which has my transformed json.
My intuition tells me I'm abusing the IO monad somehow, but I'm not quite sure how.
Thank you for your time in reading this!
So with the guidance of Luis's comment I refactored my code to the following. I don't think it is optimal (i.e. the flatMap at the end seems unecessary, but I couldnt' figure out how to remove it. BUT it's the best I've got.
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// - read the shift.roleTasks into a ShiftJson object
// - divide each task value by the task.standard where task.name = shiftJson.name
// - write the list of shiftJson back to a string
val traversed = for {
taskMap <- taskModel.findByUserId(userId).map((tList: List[Task]) => tList.map((task: Task) => (task.name -> task.standard)).toMap)
shifts <- shiftModel.findByUserId(userId)
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: List[ShiftJson] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) => ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / taskMap.get(sj.name).get ))
shift.roleTasks = write[List[ShiftJson]](lstShiftJson)
} yield traversed
traversed.flatMap((t: List[Shift]) => Ok(write[List[Shift]](t)))
Luis mentioned that mapping my List[Shift] to a Map[String, Double] is a pure operation so we want to use a map instead of flatMap.
He mentioned that I'm wrapping every operation that comes from the database in IO which is causing a great deal of recomputation. (including DB transactions)
To solve this issue I moved all of the database operations inside of my for loop, using the "<-" operator to flatMap each of the return values allows the variables being used to preside within the IO monads, hence preventing the recomputation experienced before.
I do think there must be a better way of returning my return value. flatMapping the "traversed" variable to get back inside of the IO monad seems to be unnecessary recomputation, so please anyone correct me.

Scala Enum.values.map(_.id).contains(value) cost much time

I want to check if a specify id that contained in an Enumeration.
So I write down the contains function
object Enum extends Enumeration {
type Enum = Value
val A = Value(2, "A")
def contains(value: Int): Boolean = {
But the time cost is unexpected while id is a big number, such as over eight-digit
val A = Value(222222222, "A")
Then the contains function cost over 1000ms per calling.
And I also noticed the first time calling always cost hundreds millisecond whether the id is big or small.
I can't figure out why.
First, lets talk about the cost of Enum.values. This is implemented here:
See here: https://github.com/scala/scala/blob/0b47dc2f28c997aed86d6f615da00f48913dd46c/src/library/scala/Enumeration.scala#L83
The implementation is essentially setting up a mutable map. Once it is set up, it is re-used.
The cost for big numbers in your Value is because, internally Scala library uses a BitSet.
See here: https://github.com/scala/scala/blob/0b47dc2f28c997aed86d6f615da00f48913dd46c/src/library/scala/Enumeration.scala#L245
So, for larger numbers, BitSet will be bigger. That only happens when you call Enum.values.
Depending on your specific uses case you can choose between using Enumeration or Case Object:
Case objects vs Enumerations in Scala
It sure looks like the mechanics of Enumeration don't handle large ints well in that position. The Scaladocs for the class don't say anything about this, but they don't advertise using Enumeration.Value the way you do either. They say, e.g., val A = Value, where you say val A = Value(2000, "A").
If you want to keep your contains method as you have it, why don't you cache the Enum.values.map(_.id)? Much faster.
object mult extends App {
object Enum extends Enumeration {
type Enum = Value
val A1 = Value(1, "A")
val A2 = Value(2, "A")
val A222 = Enum.Value(222222222, "A")
def contains(value: Int): Boolean = {
val cache = Enum.values.map(_.id)
def contains2(value: Int): Boolean = {
def clockit(desc: String, f: => Unit) = {
val start = System.currentTimeMillis
val end = System.currentTimeMillis
println(s"$desc ${end - start}")
clockit("initialize Enum ", Enum.A1)
clockit("contains 2 ", Enum.contains(2))
clockit("contains 222222222 ", Enum.contains(222222222))
clockit("contains 222222222 ", Enum.contains(222222222))
clockit("contains2 2 ", Enum.contains2(2))
clockit("contains2 222222222", Enum.contains2(222222222))

Parallel collection processing of data larger than memory size

Is there a simple way to use scala parallel collections without loading a full collection into memory?
For example I have a large collection and I'd like to perform a particular operation (fold) in parallel only on a small chunk, that fits into memory, than on another chunk and so on, and finally recombine results from all chunks.
I know, that actors could be used, but it would be really nice to use par-collections.
I've written a solution, but it isn't nice:
def split[A](list: Iterable[A], chunkSize: Int): Iterable[Iterable[A]] = {
new Iterator[Iterable[A]] {
var rest = list
def hasNext = !rest.isEmpty
def next = {
val chunk = rest.take(chunkSize)
rest = rest.drop(chunkSize)
def foldPar[A](acc: A)(list: Iterable[A], chunkSize: Int, combine: ((A, A) => A)): A = {
val chunks: Iterable[Iterable[A]] = split(list, chunkSize)
def combineChunk: ((A,Iterable[A]) => A) = { case (res, entries) => entries.par.fold(res)(combine) }
val chunkSize = 10000000
val x = 1 to chunkSize*10
def sum: ((Int,Int) => Int) = {case (acc,n) => acc + n }
Your idea is very neat and it's a pity there is no such function available already (AFAIK).
I just rephrased your idea into a bit shorter code. First, I feel that for parallel folding it's useful to use the concept of monoid - it's a structure with an associative operation and a zero element. The associativity is important because we don't know the order in which we combine result that are computed in parallel. And the zero element is important so that we can split computations into blocks and start folding each one from the zero. There is nothing new about it though, it's just what fold for Scala's collections expects.
// The function defined by Monoid's apply must be associative
// and zero its identity element.
trait Monoid[A]
extends Function2[A,A,A]
val zero: A
Next, Scala's Iterators already have a useful method grouped(Int): GroupedIterator[Seq[A]] which slices the iterator into fixed-size sequences. It's quite similar to your split. This allows us to cut the input into fixed-size blocks and then apply Scala's parallel collection methods on them:
def parFold[A](c: Iterator[A], blockSize: Int)(implicit monoid: Monoid[A]): A =
We fold each block using the parallel collections framework and then (without any parallelization) combine the intermediate results.
An example:
// Example:
object SumMonoid extends Monoid[Long] {
override val zero: Long = 0;
override def apply(x: Long, y: Long) = x + y;
val it = Iterator.range(1, 10000001).map(_.toLong)
println(parFold(it, 100000)(SumMonoid));

Encoding recursive tree-creation with while loop + stacks

I'm a bit embarassed to admit this, but I seem to be pretty stumped by what should be a simple programming problem. I'm building a decision tree implementation, and have been using recursion to take a list of labeled samples, recursively split the list in half, and turn it into a tree.
Unfortunately, with deep trees I run into stack overflow errors (ha!), so my first thought was to use continuations to turn it into tail recursion. Unfortunately Scala doesn't support that kind of TCO, so the only solution is to use a trampoline. A trampoline seems kinda inefficient and I was hoping there would be some simple stack-based imperative solution to this problem, but I'm having a lot of trouble finding it.
The recursive version looks sort of like (simplified):
private def trainTree(samples: Seq[Sample], usedFeatures: Set[Int]): DTree = {
if (shouldStop(samples)) {
} else {
val featureIdx = getSplittingFeature(samples, usedFeatures)
val (statsWithFeature, statsWithoutFeature) = samples.partition(hasFeature(featureIdx, _))
trainTree(statsWithFeature, usedFeatures + featureIdx),
trainTree(statsWithoutFeature, usedFeatures + featureIdx),
So basically I'm recursively subdividing the list into two according to some feature of the data, and passing through a list of used features so I don't repeat - that's all handled in the "getSplittingFeature" function so we can ignore it. The code is really simple! Still, I'm having trouble figuring out a stack-based solution that doesn't just use closures and effectively become a trampoline. I know we'll at least have to keep around little "frames" of arguments in the stack but I would like to avoid closure calls.
I get that I should be writing out explicitly what the callstack and program counter handle for me implicitly in the recursive solution, but I'm having trouble doing that without continuations. At this point it's hardly even about efficiency, I'm just curious. So please, no need to remind me that premature optimization is the root of all evil and the trampoline-based solution will probably work just fine. I know it probably will - this is basically a puzzle for it's own sake.
Can anyone tell me what the canonical while-loop-and-stack-based solution to this sort of thing is?
UPDATE: Based on Thipor Kong's excellent solution, I've coded up a while-loops/stacks/hashtable based implementation of the algorithm that should be a direct translation of the recursive version. This is exactly what I was looking for:
FINAL UPDATE: I've used sequential integer indices, as well as putting everything back into arrays instead of maps for performance, added maxDepth support, and finally have a solution with the same performance as the recursive version (not sure about memory usage but I would guess less):
private def trainTreeNoMaxDepth(startingSamples: Seq[Sample], startingMaxDepth: Int): DTree = {
// Use arraybuffer as dense mutable int-indexed map - no IndexOutOfBoundsException, just expand to fit
type DenseIntMap[T] = ArrayBuffer[T]
def updateIntMap[#specialized T](ab: DenseIntMap[T], idx: Int, item: T, dfault: T = null.asInstanceOf[T]) = {
if (ab.length <= idx) {ab.insertAll(ab.length, Iterable.fill(idx - ab.length + 1)(dfault)) }
ab.update(idx, item)
var currentChildId = 0 // get childIdx or create one if it's not there already
def child(childMap: DenseIntMap[Int], heapIdx: Int) =
if (childMap.length > heapIdx && childMap(heapIdx) != -1) childMap(heapIdx)
else {currentChildId += 1; updateIntMap(childMap, heapIdx, currentChildId, -1); currentChildId }
// go down
val leftChildren, rightChildren = new DenseIntMap[Int]() // heapIdx -> childHeapIdx
val todo = Stack((startingSamples, Set.empty[Int], startingMaxDepth, 0)) // samples, usedFeatures, maxDepth, heapIdx
val branches = new Stack[(Int, Int)]() // heapIdx, featureIdx
val nodes = new DenseIntMap[DTree]() // heapIdx -> node
while (!todo.isEmpty) {
val (samples, usedFeatures, maxDepth, heapIdx) = todo.pop()
if (shouldStop(samples) || maxDepth == 0) {
updateIntMap(nodes, heapIdx, DTLeaf(makeProportions(samples)))
} else {
val featureIdx = getSplittingFeature(samples, usedFeatures)
val (statsWithFeature, statsWithoutFeature) = samples.partition(hasFeature(featureIdx, _))
todo.push((statsWithFeature, usedFeatures + featureIdx, maxDepth - 1, child(leftChildren, heapIdx)))
todo.push((statsWithoutFeature, usedFeatures + featureIdx, maxDepth - 1, child(rightChildren, heapIdx)))
branches.push((heapIdx, featureIdx))
// go up
while (!branches.isEmpty) {
val (heapIdx, featureIdx) = branches.pop()
updateIntMap(nodes, heapIdx, DTBranch(nodes(child(leftChildren, heapIdx)), nodes(child(rightChildren, heapIdx)), featureIdx))
Just store the binary tree in an array, as described on Wikipedia: For node i, the left child goes into 2*i+1 and the right child in to 2*i+2. When doing "down", you keep a collection of todos, that still have to be splitted to reach a leaf. Once you've got only leafs, to go upward (from right to left in the array) to build the decision nodes:
Update: A cleaned up version, that also supports the features stored int the branches (type parameter B) and that is more functional/fully pure and that supports sparse trees with a map as suggested by ron.
Update2-3: Make economical use of name space for node ids and abstract over type of ids to allow of large trees. Take node ids from Stream.
sealed trait DTree[A, B]
case class DTLeaf[A, B](a: A, b: B) extends DTree[A, B]
case class DTBranch[A, B](left: DTree[A, B], right: DTree[A, B], b: B) extends DTree[A, B]
def mktree[A, B, Id](a: A, b: B, split: (A, B) => Option[(A, A, B)], ids: Stream[Id]) = {
def goDown(todo: Seq[(A, B, Id)], branches: Seq[(Id, B, Id, Id)], leafs: Map[Id, DTree[A, B]], ids: Stream[Id]): (Seq[(Id, B, Id, Id)], Map[Id, DTree[A, B]]) =
todo match {
case Nil => (branches, leafs)
case (a, b, id) :: rest =>
split(a, b) match {
case None =>
goDown(rest, branches, leafs + (id -> DTLeaf(a, b)), ids)
case Some((left, right, b2)) =>
val leftId #:: rightId #:: idRest = ids
goDown((right, b2, rightId) +: (left, b2, leftId) +: rest, (id, b2, leftId, rightId) +: branches, leafs, idRest)
def goUp[A, B](branches: Seq[(Id, B, Id, Id)], nodes: Map[Id, DTree[A, B]]): Map[Id, DTree[A, B]] =
branches match {
case Nil => nodes
case (id, b, leftId, rightId) :: rest =>
goUp(rest, nodes + (id -> DTBranch(nodes(leftId), nodes(rightId), b)))
val rootId #:: restIds = ids
val (branches, leafs) = goDown(Seq((a, b, rootId)), Seq(), Map(), restIds)
goUp(branches, leafs)(rootId)
// try it out
def split(xs: Seq[Int], b: Int) =
if (xs.size > 1) {
val (left, right) = xs.splitAt(xs.size / 2)
Some((left, right, b + 1))
} else {
val tree = mktree(0 to 1000, 0, split _, Stream.from(0))

Easiest way to decide if List contains duplicates?

One way is this
list.distinct.size != list.size
Is there any better way? It would have been nice to have a containsDuplicates method
Assuming "better" means "faster", see the alternative approaches benchmarked in this question, which seems to show some quicker methods (although note that distinct uses a HashSet and is already O(n)). YMMV of course, depending on specific test case, scala version etc. Probably any significant improvement over the "distinct.size" approach would come from an early-out as soon as a duplicate is found, but how much of a speed-up is actually obtained would depend strongly on how common duplicates actually are in your use-case.
If you mean "better" in that you want to write list.containsDuplicates instead of containsDuplicates(list), use an implicit:
implicit def enhanceWithContainsDuplicates[T](s:List[T]) = new {
def containsDuplicates = (s.distinct.size != s.size)
You can also write:
list.toSet.size != list.size
But the result will be the same because distinct is already implemented with a Set. In both case the time complexity should be O(n): you must traverse the list and Set insertion is O(1).
I think this would stop as soon as a duplicate was found and is probably more efficient than doing distinct.size - since I assume distinct keeps a set as well:
def containsDups[A](list: List[A], seen: Set[A] = Set[A]()): Boolean =
list match {
case x :: xs => if (seen.contains(x)) true else containsDups(xs, seen + x)
case _ => false
// Boolean = true
// Boolean = false
I realize you asked for easy and I don't now that this version is, but finding a duplicate is also finding if there is an element that has been seen before:
def containsDups[A](list: List[A]): Boolean = {
list.iterator.scanLeft(Set[A]())((set, a) => set + a) // incremental sets
.exists{ case (set, a) => set contains a }
def containsDuplicates [T] (s: Seq[T]) : Boolean =
if (s.size < 2) false else
s.tail.contains (s.head) || containsDuplicates (s.tail)
I didn't measure this, and think it is similar to huynhjl's solution, but a bit more simple to understand.
It returns early, if a duplicate is found, so I looked into the source of Seq.contains, whether this returns early - it does.
In SeqLike, 'contains (e)' is defined as 'exists (_ == e)', and exists is defined in TraversableLike:
def exists (p: A => Boolean): Boolean = {
var result = false
breakable {
for (x <- this)
if (p (x)) { result = true; break }
I'm curious how to speed things up with parallel collections on multi cores, but I guess it is a general problem with early-returning, while another thread will keep running, because it doesn't know, that the solution is already found.
I've written a very efficient function which returns both List.distinct and a List consisting of each element which appeared more than once and the index at which the element duplicate appeared.
Note: This answer is a straight copy of the answer on a related question.
If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List (as ordering was significant) exactly once and returns a Tuple2 consisting of the original List deduped (all duplicates after the first are removed; i.e. the same as invoking distinct) and a second List showing each duplicate and an Int index at which it occurred within the original List.
Here's the function:
def filterDupes[A](items: List[A]): (List[A], List[(A, Int)]) = {
def recursive(remaining: List[A], index: Int, accumulator: (List[A], List[(A, Int)])): (List[A], List[(A, Int)]) =
if (remaining.isEmpty)
, index + 1
, if (accumulator._1.contains(remaining.head))
(accumulator._1, (remaining.head, index) :: accumulator._2)
(remaining.head :: accumulator._1, accumulator._2)
val (distinct, dupes) = recursive(items, 0, (Nil, Nil))
(distinct.reverse, dupes.reverse)
An below is an example which might make it a bit more intuitive. Given this List of String values:
val withDupes =
List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")
...and then performing the following:
val (deduped, dupeAndIndexs) =
...the results are:
deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))
And if you just want the duplicates, you simply map across dupeAndIndexes and invoke distinct:
val dupesOnly =
...or all in a single call:
val dupesOnly =
...or if a Set is preferred, skip distinct and invoke toSet...
val dupesOnly2 =
...or all in a single call:
val dupesOnly2 =
This is a straight copy of the filterDupes function out of my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._.
If you're trying to check for duplicates in a test then ScalaTest can be helpful.
import org.scalatest.Inspectors._
import org.scalatest.Matchers._
forEvery(list.distinct) { item =>
withClue(s"value $item, the number of occurences") {
list.count(_ == item) shouldBe 1
// example:
scala> val list = List(1,2,3,4,3,2)
list: List[Int] = List(1, 2, 3, 4, 3, 2)
scala> forEvery(list) { item => withClue(s"value $item, the number of occurences") { list.count(_ == item) shouldBe 1 } }
org.scalatest.exceptions.TestFailedException: forEvery failed, because:
at index 1, value 2, the number of occurences 2 was not equal to 1 (<console>:19),
at index 2, value 3, the number of occurences 2 was not equal to 1 (<console>:19)
in List(1, 2, 3, 4)