Tail recursion with List + .toVector or Vector? - scala

val dimensionality = 10
val zeros = DenseVector.zeros[Double](dimensionality)
#tailrec private def specials(list: List[DenseVector[Int]], i: Int): List[DenseVector[Int]] = {
if(i >= dimensionality) list
else {
val vec = zeros.copy
vec(i to i) := 1
specials(vec :: list, i + 1)
}
}
val specialList = specials(Nil, 0).toVector
specialList.map(...doing my thing...)
Should I write my tail recursive function using a List as accumulator above and then write
specials(Nil, 0).toVector
or should I write my trail recursion with a Vector in the first place? What is computationally more efficient?
By the way: specialList is a list that contains DenseVectors where every entry is 0 with the exception of one entry, which is 1. There are as many DenseVectors as they are long.

I'm not sur what you're trying to do here but you could rewrite your code like so:
type Mat = List[Vector[Int]]
#tailrec
private def specials(mat: Mat, i: Int): Mat = i match {
case `dimensionality` => mat
case _ =>
val v = zeros.copy.updated(i,1)
specials(v :: mat, i + 1)
}
As you are dealing with a matrix, Vector is probably a better choice.

Let's compare the performance characteristics of both variants:
List: prepending takes constant time, conversion to Vector takes linear time.
Vector: prepending takes "effectively" constant time (eC), no subsequent conversion needed.
If you compare the implementations of List and Vector, then you'll find out that prepending to a List is a simpler and cheaper operation than prepending to a Vector. Instead of just adding another element at the front as it is done by List, Vector potentially has to replace a whole branch/subtree internally. On average, this still happens in constant time ("effectively" constant, because the subtrees can differ in their size), but is more expensive than prepending to List. On the plus side, you can avoid the call to toVector.
Eventually, the crucial point of interest is the size of the collection you want to create (or in other words, the amount of recursive prepend-steps you are doing). It's totally possible that there is no clear winner and one of the two variants is faster for <= n steps, whereas the other variant is faster for > n steps. In my naive toy benchmark, List/toVecor seemed to be faster for less than 8k elements, but you should perform a set of well-chosen benchmarks that represent your scenario adequately.

Related

find Duplicates in Array scala

Is there a better way to find duplicates in an Array which has better time & space complexity, below is what i have tried
I believe Time complexity is O(N) and space Complexity is O(1)
def findDuplicates(nums:Array[Int]):ArrayBuffer[Int] ={
var buckets =new HashMap[Int,String]()
var outputArr= new ArrayBuffer[Int]()
nums.foreach(x=>
if(buckets.contains(x) && buckets(x) == "Im Cool")
{
outputArr +=x
}
else
buckets(x) = "Im Cool"
)
outputArr
}
Both the time and space complexity of your algorithm is O(N), where N = |nums|.
Time
HashMap operations contains, put, get all have an average O(1) time complexity, appending to an array also has an average O(1) complexity. You algorithm invokes contains and get N times and put and array append maximum N times. This gives O(N).
Space
The size of buckets grows linearly with N. In a test case where N is twice as big, the size of buckets will be appr. twice as big, too. Same for outputArr. So this gives O(N), too.
Optimization
Your approach is optimal in terms of the theoretical complexity. Because duplicate elements can be anywhere in the input array, you must read every element, unless you have some prior knowledge about the array. So the time complexity cannot be less than O(N).
The output array may contain up to N-1 elements (example: [0, 0, 0] returns [0, 0]), therefore the space complexity cannot be less than O(N).
However, you can optimize your implementation, both in terms of actual speed and readability by using a HashSet for storing elements that you have already seen.
def findDuplicates(nums:Array[Int]):ArrayBuffer[Int] ={
var buckets = new HashSet[Int]()
var outputArr = new ArrayBuffer[Int]()
nums.foreach(x=>
if(buckets.contains(x)) {
outputArr += x
}
else {
buckets.add(x)
}
)
outputArr
}
This removes the magic "Im Cool" strings and saves the constant time of string comparisons.
As you've discovered, it is possible to write C code in the Scala language, but it's not a good way to learn Scala style.
Adhering to FP principles can, sometimes, make it even more difficult to solve LeetCode challenges.
But Scala can be a pretty good choice when playing code golf.
def findDuplicates(nums:Array[Int]):Array[Int] =
nums diff nums.distinct
A simple and elegant solution would be this:
def getDuplicates[T](data: List[T]): List[T] =
data
.groupMapReduce(identity)(_ => 1)(_ + _)
.iterator
.collect {
case (x, count) if (count > 1) => x
}.toList
The time complexity is O(N) since it does two traversals of the data.
A first one to compute the number of times each element was present and a second one to filter and conserve elements that appear at least two times.
BTW, I would stay away from Arrays unless you really need to.
Arrays are to be avoided because they are mutable and invariant, also they are not proper collections, they are a JVM primitive and using them effectively is complex.
def getDuplicates[T](nums: Array[T]): List[T] = {
nums.foldLeft(Map.empty[T, Int])((a,b) => a.updated(b, a.getOrElse(b, 0) + 1))
.filter(_._2 > 1).flatMap(e => List.fill(e._2 - 1)(e._1)).toList
}
Time-Complexity: O(N)
Space-Complexity: O(N) Worst Case Only If Output Array Is Not Considered. If Considered Space Complexity will be O(1).

Scala: Sort a ListBuffer of (Int, Int, Int) by the third value most efficiently

I am looking to sort a ListBuffer(Int, Int, Int) by the third value most efficiently. This is what I currently have. I am using sortBy. Note y and z are both ListBuffer[(Int, Int, Int)], which I am taking the difference first. My goal is to optimize this and do this operation (taking the difference between the two lists and sorting by the third element) most efficiently. I am assuming the diff part cannot be optimized but the sortBy can, so I am looking for a efficient way to do to the sorting part. I found posts on sorting Arrays but I am working with ListBuffer and converting to an Array adds overhead, so I rather not to convert my ListBuffer.
val x = (y diff z).sortBy(i => i._3)
1) If you want to use Scala libraries then you can't do much better than that. Scala already tries to sort your collection in the most efficient way possible.
SeqLike defines def sortBy[B](f: A => B)(implicit ord: Ordering[B]): Repr = sorted(ord on f) which calls this implementation:
def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
val len = this.length
val arr = new ArraySeq[A](len)
var i = 0
for (x <- this.seq) {
arr(i) = x
i += 1
}
java.util.Arrays.sort(arr.array, ord.asInstanceOf[Ordering[Object]])
val b = newBuilder
b.sizeHint(len)
for (x <- arr) b += x
b.result
}
This is what your code will be calling. As you can see it already uses arrays to sort data in place. According to the javadoc of public static void sort(Object[] a):
Implementation note: This implementation is a stable, adaptive,
iterative mergesort that requires far fewer than n lg(n) comparisons
when the input array is partially sorted, while offering the
performance of a traditional mergesort when the input array is
randomly ordered. If the input array is nearly sorted, the
implementation requires approximately n comparisons.
2) If you try to optimize by inserting results of your diff directly into a sorted structure like a binary tree as you produce them element by element, you'll still be paying the same price: average cost of insertion is log(n) times n elements = n log(n) - same as any fast sorting algorithm like merge sort.
3) Thus you can't optimize this case generically unless you optimize to your particular use-case.
3a) For instance, ListBuffer might be replaced with a Set and diff should be much faster. In fact it's implemented as:
def diff(that: GenSet[A]): This = this -- that
which uses - which in turn should be faster than diff on Seq which has to build a map first:
def diff[B >: A](that: GenSeq[B]): Repr = {
val occ = occCounts(that.seq)
val b = newBuilder
for (x <- this)
if (occ(x) == 0) b += x
else occ(x) -= 1
b.result
}
3b) You can also avoid sorting by using _3 as an index in an array. If you insert using that index your array will be sorted. This will only work if your data is dense enough or you are happy to deal with sparse array afterwards. One index might also have multiple values mapping to it, you'll have to deal with it as well. Effectively you are building a sorted map. You can use a Map for that as well, but a HashMap won't be sorted and a TreeMap will require log(n) for add operation again.
Consult Scala Collections Performance Characteristics to understand what you can gain based on your case.
4) Anyhow, sort is really fast on modern computers. Do some benchmarking to make sure you are not prematurely optimizing it.
To summarize complexity for different scenarios...
Your current case:
diff for SeqLike: n to create a map from that + n to iterate over this (map lookup is effectively constant time (C)) = 2n or O(n)
sort - O(n log(n))
total = O(n) + O(n log(n)) = O(n log(n)), more precisely: 2n + nlog(n)
If you use Set instead of SeqLike:
diff for Set: n to iterate (lookup is C) = O(n)
sort - same
total - same: O(n) + O(n log(n)) = O(n log(n)), more precisely: n + nlog(n)
If you use Set and array to insert:
diff - same as for Set
sort - 0 - array is sorted by construction
total: O(n) + O(0) = O(n), more precisely: n. Might not be very practical for sparse data.
Looks like in the grand scheme of things it does not matter that much unless you have a unique case that benefits from last option (array).
If you would have a ListBuffer[Int] rather than ListBuffer[(Int, Int, Int)] I would suggest to sort both collections first and then do a diff by doing a single pass through both of them at the same time. This would be O(nlog(n)). In your case a sort by _3 is not sufficient to guarantee exact order in both collections. You can sort by all three fields of a tuple but that will change the original ordering. If you are fine with that and writing your own diff then it might be the fastest option.

Upper Triangular Matrix in Scala

Is there a way I can perform a faster computation of upper triangle matrix in scala?
/** Returns a vector which consists of the upper triangular elements of a matrix */
def getUpperTriangle(A: Array[Array[Double]]) =
{
var A_ = Seq(0.)
for (i <- 0 to A.size - 1;j <- 0 to A(0).size - 1)
{
if (i <= j){
A_ = A_ ++ Seq(A(i)(j))
}
}
A_.tail.toArray
}
I don't know about faster, but this is a lot shorter and more "functional" (I note you tagged your question with functional-programming)
def getUpperTriangle(a: Array[Array[Double]]) =
(0 until a.size).flatMap(i => a(i).drop(i)).toArray
or, more or less same idea:
def getUpperTriangle(a: Array[Array[Double]]) =
a.zipWithIndex.flatMap{case(r,i) => r.drop(i)}
Here are three basic things you can do to streamline your logic to improve performance:
Start with an empty Seq, so you don't have to call Seq.tail at the end. The tail operation is going to be O(n), since the Seq factory methods give you an IndexedSeq
Use Seq.:+ to append a single element to the Seq, instead of constructing a Seq with a single element, and using Seq.++ to append two Seqs. Seq.:+ is going to be O(1) (amortized) and quite fast for an IndexedSeq. Using Seq.++ with a single-element sequence is probably still O(1), but will have a good bit more overhead.
You can start j at i instead of starting j at 0 and testing i <= j in the body of the loop. This will save n^2/2 no-op loop iterations.
Some stylistic things:
It's best to always include the return type. You actually get a deprecation warning without it.
We use lowercase for variable names in Scala
0 until size is perhaps more readable than 0 to size - 1
def getUpperTriangle(a: Array[Array[Double]]): Array[Double] = {
var result = Seq[Double]()
for (i <- 0 until a.size; j <- i until A(0).size) {
result = result :+ a(i)(j)
}
result.toArray
}

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54

Combination of elements

Problem:
Given a Seq seq and an Int n.
I basically want all combinations of the elements up to size n. The arrangement matters, meaning e.g. [1,2] is different that [2,1].
def combinations[T](seq: Seq[T], size: Int) = ...
Example:
combinations(List(1,2,3), 0)
//Seq(Seq())
combinations(List(1,2,3), 1)
//Seq(Seq(), Seq(1), Seq(2), Seq(3))
combinations(List(1,2,3), 2)
//Seq(Seq(), Seq(1), Seq(2), Seq(3), Seq(1,2), Seq(2,1), Seq(1,3), Seq(3,1),
//Seq(2,3), Seq(3,2))
...
What I have so far:
def combinations[T](seq: Seq[T], size: Int) = {
#tailrec
def inner(seq: Seq[T], soFar: Seq[Seq[T]]): Seq[Seq[T]] = seq match {
case head +: tail => inner(tail, soFar ++ {
val insertList = Seq(head)
for {
comb <- soFar
if comb.size < size
index <- 0 to comb.size
} yield comb.patch(index, insertList, 0)
})
case _ => soFar
}
inner(seq, IndexedSeq(IndexedSeq.empty))
}
What would be your approach to this problem? This method will be called a lot and therefore it should be made most efficient.
There are methods in the library like subsets or combinations (yea I chose the same name), which return iterators. I also thought about that, but I have no idea yet how to design this lazily.
Not sure if this is efficient enough for your purpose but it's the simplest approach.
def combinations[T](seq: Seq[T], size: Int) : Seq[Seq[T]] = {
(1 to size).flatMap(i => seq.combinations(i).flatMap(_.permutations))
}
edit:
to make it lazy you can use a view
def combinations[T](seq: Seq[T], size: Int) : Iterable[Seq[T]] = {
(1 to size).view.flatMap(i => seq.combinations(i).flatMap(_.permutations))
}
From permutations theory we know that the number of permutations of K elements taken from a set of N elements is
N! / (N - K)!
(see http://en.wikipedia.org/wiki/Permutation)
Therefore if you wanna build them all, you will have
algorithm complexity = number of permutations * cost of building each permutation
The potential optimization of the algorithm lies in minimizing the cost of building each permutation, by using a data structure that has some appending / prepending operation that runs in O(1).
You are using an IndexedSeq, which is a collection optimized for O(1) random access. When collections are optimized for random access they are backed by arrays. When using such collections (this is also valid for java ArrayList) you give up the guarantee of a low cost insertion operation: sometimes the array won't be big enough and the collection will have to create a new one and copy all the elements.
When using instead linked lists (such as scala List, which is the default implementation for Seq) you take the opposite choice: you give up constant time access for constant time insertion. In particular, scala List is a recursive data structure with constant time insertion at the front.
So if you are looking for high performance and you need the collection to be available eagerly, use a Seq.empty instead of IndexedSeq.empty and at each iteration prepend the new element at the head of the Seq. If you need something lazy, use Stream which will minimize memory occupation. Additional strategies worth exploring is to create an empty IndexedSeq for your first iteration, but not through Indexed.empty. Use instead the builder and try to provide an array which has the right size (N! / (N-K)!)