What is time complexity of slice in scala? - scala

What is time complexity of scala slice method? Is it O(m) or O(n),
where m is number of elements in slice and n is number of elements in collection.
More specific question: what is time complexity of
someMap.slice(i, i + 1).keys.head, where i is random int less than someMap.size? If slice complexity is O(m) then it should be O(1), right?

This clearly depends on the underlying datatype. Slicing an Array, ArrayBuffer, ByteBuffer, or any other sub-class of IndexedSeqOptimized for example is O(k) if you're slicing k elements from the container. List, for example, is O(n) you can see by its implementation. You'll likely want to check the source for your specific type.
Map gets its implementation of slice from IterableLike. From the implementation pasted below, it's clearly the cost to iterate over elements in the collection plus the cost to create a newly built map.
For TreeMap this should be O(n + k lg k) if you're slicing k elements out of the map.
For HashMap this should be bounded by O(n).
override /*TraversableLike*/ def slice(from: Int, until: Int): Repr = {
val lo = math.max(from, 0)
val elems = until - lo
val b = newBuilder
if (elems <= 0) b.result()
else {
b.sizeHintBounded(elems, this)
var i = 0
val it = iterator drop lo
while (i < elems && it.hasNext) {
b += it.next
i += 1
}
b.result()
}
}
When in doubt, look at the source.

Related

Scala: Sort a ListBuffer of (Int, Int, Int) by the third value most efficiently

I am looking to sort a ListBuffer(Int, Int, Int) by the third value most efficiently. This is what I currently have. I am using sortBy. Note y and z are both ListBuffer[(Int, Int, Int)], which I am taking the difference first. My goal is to optimize this and do this operation (taking the difference between the two lists and sorting by the third element) most efficiently. I am assuming the diff part cannot be optimized but the sortBy can, so I am looking for a efficient way to do to the sorting part. I found posts on sorting Arrays but I am working with ListBuffer and converting to an Array adds overhead, so I rather not to convert my ListBuffer.
val x = (y diff z).sortBy(i => i._3)
1) If you want to use Scala libraries then you can't do much better than that. Scala already tries to sort your collection in the most efficient way possible.
SeqLike defines def sortBy[B](f: A => B)(implicit ord: Ordering[B]): Repr = sorted(ord on f) which calls this implementation:
def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
val len = this.length
val arr = new ArraySeq[A](len)
var i = 0
for (x <- this.seq) {
arr(i) = x
i += 1
}
java.util.Arrays.sort(arr.array, ord.asInstanceOf[Ordering[Object]])
val b = newBuilder
b.sizeHint(len)
for (x <- arr) b += x
b.result
}
This is what your code will be calling. As you can see it already uses arrays to sort data in place. According to the javadoc of public static void sort(Object[] a):
Implementation note: This implementation is a stable, adaptive,
iterative mergesort that requires far fewer than n lg(n) comparisons
when the input array is partially sorted, while offering the
performance of a traditional mergesort when the input array is
randomly ordered. If the input array is nearly sorted, the
implementation requires approximately n comparisons.
2) If you try to optimize by inserting results of your diff directly into a sorted structure like a binary tree as you produce them element by element, you'll still be paying the same price: average cost of insertion is log(n) times n elements = n log(n) - same as any fast sorting algorithm like merge sort.
3) Thus you can't optimize this case generically unless you optimize to your particular use-case.
3a) For instance, ListBuffer might be replaced with a Set and diff should be much faster. In fact it's implemented as:
def diff(that: GenSet[A]): This = this -- that
which uses - which in turn should be faster than diff on Seq which has to build a map first:
def diff[B >: A](that: GenSeq[B]): Repr = {
val occ = occCounts(that.seq)
val b = newBuilder
for (x <- this)
if (occ(x) == 0) b += x
else occ(x) -= 1
b.result
}
3b) You can also avoid sorting by using _3 as an index in an array. If you insert using that index your array will be sorted. This will only work if your data is dense enough or you are happy to deal with sparse array afterwards. One index might also have multiple values mapping to it, you'll have to deal with it as well. Effectively you are building a sorted map. You can use a Map for that as well, but a HashMap won't be sorted and a TreeMap will require log(n) for add operation again.
Consult Scala Collections Performance Characteristics to understand what you can gain based on your case.
4) Anyhow, sort is really fast on modern computers. Do some benchmarking to make sure you are not prematurely optimizing it.
To summarize complexity for different scenarios...
Your current case:
diff for SeqLike: n to create a map from that + n to iterate over this (map lookup is effectively constant time (C)) = 2n or O(n)
sort - O(n log(n))
total = O(n) + O(n log(n)) = O(n log(n)), more precisely: 2n + nlog(n)
If you use Set instead of SeqLike:
diff for Set: n to iterate (lookup is C) = O(n)
sort - same
total - same: O(n) + O(n log(n)) = O(n log(n)), more precisely: n + nlog(n)
If you use Set and array to insert:
diff - same as for Set
sort - 0 - array is sorted by construction
total: O(n) + O(0) = O(n), more precisely: n. Might not be very practical for sparse data.
Looks like in the grand scheme of things it does not matter that much unless you have a unique case that benefits from last option (array).
If you would have a ListBuffer[Int] rather than ListBuffer[(Int, Int, Int)] I would suggest to sort both collections first and then do a diff by doing a single pass through both of them at the same time. This would be O(nlog(n)). In your case a sort by _3 is not sufficient to guarantee exact order in both collections. You can sort by all three fields of a tuple but that will change the original ordering. If you are fine with that and writing your own diff then it might be the fastest option.

Tail recursion with List + .toVector or Vector?

val dimensionality = 10
val zeros = DenseVector.zeros[Double](dimensionality)
#tailrec private def specials(list: List[DenseVector[Int]], i: Int): List[DenseVector[Int]] = {
if(i >= dimensionality) list
else {
val vec = zeros.copy
vec(i to i) := 1
specials(vec :: list, i + 1)
}
}
val specialList = specials(Nil, 0).toVector
specialList.map(...doing my thing...)
Should I write my tail recursive function using a List as accumulator above and then write
specials(Nil, 0).toVector
or should I write my trail recursion with a Vector in the first place? What is computationally more efficient?
By the way: specialList is a list that contains DenseVectors where every entry is 0 with the exception of one entry, which is 1. There are as many DenseVectors as they are long.
I'm not sur what you're trying to do here but you could rewrite your code like so:
type Mat = List[Vector[Int]]
#tailrec
private def specials(mat: Mat, i: Int): Mat = i match {
case `dimensionality` => mat
case _ =>
val v = zeros.copy.updated(i,1)
specials(v :: mat, i + 1)
}
As you are dealing with a matrix, Vector is probably a better choice.
Let's compare the performance characteristics of both variants:
List: prepending takes constant time, conversion to Vector takes linear time.
Vector: prepending takes "effectively" constant time (eC), no subsequent conversion needed.
If you compare the implementations of List and Vector, then you'll find out that prepending to a List is a simpler and cheaper operation than prepending to a Vector. Instead of just adding another element at the front as it is done by List, Vector potentially has to replace a whole branch/subtree internally. On average, this still happens in constant time ("effectively" constant, because the subtrees can differ in their size), but is more expensive than prepending to List. On the plus side, you can avoid the call to toVector.
Eventually, the crucial point of interest is the size of the collection you want to create (or in other words, the amount of recursive prepend-steps you are doing). It's totally possible that there is no clear winner and one of the two variants is faster for <= n steps, whereas the other variant is faster for > n steps. In my naive toy benchmark, List/toVecor seemed to be faster for less than 8k elements, but you should perform a set of well-chosen benchmarks that represent your scenario adequately.

Upper Triangular Matrix in Scala

Is there a way I can perform a faster computation of upper triangle matrix in scala?
/** Returns a vector which consists of the upper triangular elements of a matrix */
def getUpperTriangle(A: Array[Array[Double]]) =
{
var A_ = Seq(0.)
for (i <- 0 to A.size - 1;j <- 0 to A(0).size - 1)
{
if (i <= j){
A_ = A_ ++ Seq(A(i)(j))
}
}
A_.tail.toArray
}
I don't know about faster, but this is a lot shorter and more "functional" (I note you tagged your question with functional-programming)
def getUpperTriangle(a: Array[Array[Double]]) =
(0 until a.size).flatMap(i => a(i).drop(i)).toArray
or, more or less same idea:
def getUpperTriangle(a: Array[Array[Double]]) =
a.zipWithIndex.flatMap{case(r,i) => r.drop(i)}
Here are three basic things you can do to streamline your logic to improve performance:
Start with an empty Seq, so you don't have to call Seq.tail at the end. The tail operation is going to be O(n), since the Seq factory methods give you an IndexedSeq
Use Seq.:+ to append a single element to the Seq, instead of constructing a Seq with a single element, and using Seq.++ to append two Seqs. Seq.:+ is going to be O(1) (amortized) and quite fast for an IndexedSeq. Using Seq.++ with a single-element sequence is probably still O(1), but will have a good bit more overhead.
You can start j at i instead of starting j at 0 and testing i <= j in the body of the loop. This will save n^2/2 no-op loop iterations.
Some stylistic things:
It's best to always include the return type. You actually get a deprecation warning without it.
We use lowercase for variable names in Scala
0 until size is perhaps more readable than 0 to size - 1
def getUpperTriangle(a: Array[Array[Double]]): Array[Double] = {
var result = Seq[Double]()
for (i <- 0 until a.size; j <- i until A(0).size) {
result = result :+ a(i)(j)
}
result.toArray
}

How to pick a random value from a collection in Scala

I need a method to pick uniformly a random value from a collection.
Here is my current impl.
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = elements.toSeq(Random.nextInt(elements.size))
}
But this code instantiate a new collection, so not ideal in term of memory.
Any way to improve ?
[update] make it work with Iterator
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = {
val seq = elements.toSeq
seq(Random.nextInt(seq.size))
}
}
It may seem at first glance that you can't do this without counting the elements first, but you can!
Iterate through the sequence f and take each element fi with probability 1/i:
def choose[A](it: Iterator[A], r: util.Random): A =
it.zip(Iterator.iterate(1)(_ + 1)).reduceLeft((x, y) =>
if (r.nextInt(y._2) == 0) y else x
)._1
A quick demonstration of uniformity:
scala> ((1 to 1000000)
| .map(_ => choose("abcdef".iterator, r))
| .groupBy(identity).values.map(_.length))
res45: Iterable[Int] = List(166971, 166126, 166987, 166257, 166698, 166961)
Here's a discussion of the math I wrote a while back, though I'm afraid it's a bit unnecessarily long-winded. It also generalizes to choosing any fixed number of elements instead of just one.
Simplest way is just to think of the problem as zipping the collection with an equal-sized list of random numbers, and then just extract the maximum element. You can do this without actually realizing the zipped sequence. This does require traversing the entire iterator, though
val maxElement = s.maxBy(_=>Random.nextInt)
Or, for the implicit version
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = elements.maxBy(_=>Random.nextInt)
}
It's possible to select an element uniformly at random from a collection, traversing it once without copying the collection.
The following algorithm will do the trick:
def choose[A](elements: TraversableOnce[A]): A = {
var x: A = null.asInstanceOf[A]
var i = 1
for (e <- elements) {
if (Random.nextDouble <= 1.0 / i) {
x = e
}
i += 1
}
x
}
The algorithm works by at each iteration makes a choice: take the new element with probability 1 / i, or keep the previous one.
To understand why the algorithm choose the element uniformly at random, consider this: Start by considering an element in the collection, for example the first one (in this example the collection only has three elements).
At iteration:
Chosen with probability: 1.
Chosen with probability:
(probability of keeping the element at previous iteration) * (keeping at current iteration)
probability => 1 * 1/2 = 1/2
Chosen with probability: 1/2 * 2/3=1/3 (in other words, uniformly)
If we take another element, for example the second one:
0 (not possible to choose the element at this iteration).
1/2.
1/2*2/3=1/3.
Finally for the third one:
0.
0.
1/3.
This shows that the algorithm selects an element uniformly at random. This can be proved formally using induction.
If the collection is large enough that you care about about instantiations, here is the constant memory solution (I assume, it contains ints' but that only matters for passing initial param to fold):
collection.fold((0, 0)) {
case ((0, _), x) => (1, x)
case ((n, x), _) if (random.nextDouble() > 1.0/n) => (n+1, x)
case ((n, _), x) => (n+1, x)
}._2
I am not sure if this requires a further explanation ... Basically, it does the same thing that #svenslaggare suggested above, but in a functional way, since this is tagged as a scala question.

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54