Seq with maximal elements - scala

I have a Seq and function Int => Int. What I need to achieve is to take from original Seq only thoose elements that would be equal to the maximum of the resulting sequence (the one, I'll have after applying given function):
def mapper:Int=>Int= x=>x*x
val s= Seq( -2,-2,2,2 )
val themax= s.map(mapper).max
s.filter( mapper(_)==themax)
But this seems wasteful, since it has to map twice (once for the filter, other for the maximum).
Is there a better way to do this? (without using a cycle, hopefully)
EDIT
The code has since been edited; in the original this was the filter line: s.filter( mapper(_)==s.map(mapper).max). As om-nom-nom has pointed out, this evaluates `s.map(mapper).max each (filter) iteration, leading to quadratic complexity.

Here is a solution that does the mapping only once and using the `foldLeft' function:
The principle is to go through the seq and for each mapped element if it is greater than all mapped before then begin a new sequence with it, otherwise if it is equal return the list of all maximums and the new mapped max. Finally if it is less then return the previously computed Seq of maximums.
def getMaxElems1(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = s.foldLeft(Seq[(Int,Int)]())((res, elem) => {
val e2 = mapper(elem)
if(res.isEmpty || e2>res.head._2)
Seq((elem,e2))
else if (e2==res.head._2)
res++Seq((elem,e2))
else res
}).map(_._1) // keep only original elements
// test with your list
scala> getMaxElems1(s)(mapper)
res14: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems1(Seq(-1, 2,0, -2, 1,-2))(mapper)
res15: Seq[Int] = List(2, -2, -2)
Remark: About complexity
The algorithm I present above has a complexity of O(N) for a list with N elements. However:
the operation of mapping all elements is of complexity O(N)
the operation of computing the max is of complexity O(N)
the operation of zipping is of complexity O(N)
the operation of filtering the list according to the max is also of complexity O(N)
the operation of mapping all elements is of complexity O(M), with M the number of final elements
So, finally the algorithm you presented in your question has the same complexity (quality) than my answer's one, moreover the solution you present is more clear than mine. So, even if the 'foldLeft' is more powerful, for this operation I would recommend your idea, but with zipping original list and computing the map only once (especially if your map is more complicated than a simple square). Here is the solution computed with the help of *scala_newbie* in question/chat/comments.
def getMaxElems2(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = {
val mappedS = s.map(mapper) //map done only once
val m = mappedS.max // find the max
s.zip(mappedS).filter(_._2==themax).unzip._1
}
// test with your list
scala> getMaxElems2(s)(mapper)
res16: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems2(Seq(-1, 2,0, -2, 1,-2))(mapper)
res17: Seq[Int] = List(2, -2, -2)

Related

Subsequence in a sequence of numbers in scala

For example i have:
List(1,3,2,5)
How i get all these:
List(1), List(3), List(2), List(5), List(1,3), List(1,2), List(1,5), List(3,2), List(3,5), List(2,5), List(1,3,2), List(1,3,5), List(1,2,5), List(3,2,5), List(1,3,2,5))
I need this for Longest Increasing Subsequence -problem and this answer for example would be: (1,3,5)
And I want use it for bigger Lists.
You want all the combinations() from 1 to array length.
val arr = Array(4, 3, 1)
arr.indices.flatMap(x => arr.combinations(x + 1))
//res0: Seq[Array[Int]] = Vector(Array(4), Array(3), Array(1), Array(4, 3), Array(4, 1), Array(3, 1), Array(4, 3, 1))
update
This will give you all possible combinations, retaining original order and duplicate elements.
def subseqs[A](seq :Seq[A]) :List[Seq[A]] = seq match {
case hd +: tl =>
val res = subseqs(tl)
Seq(hd) :: res ++ res.map(hd +: _)
case Seq() => Nil
}
The result is a List of n^2 - 1 possible sub-sequences. So for a collection of 8 elements you'll get 255 sub-sequences.
This, of course, is going to be way too tedious and inefficient for your purposes. Generating all possible sub-sequences in order to find the Longest Increasing is a little like washing all the clothes in your neighborhood so you'll have clean socks in the morning.
Much better to generate only the increasing sub-sequences and find the longest from that set (9 lines of code).
[ Update in response to updated question ]
[ Thanks to #jwvh for spotting error in original version ]
This method will generate all possible sub-sequences of a List:
def subsequences[T](list: List[T]): List[List[T]] =
list match {
case Nil =>
List(List())
case hd :: tl =>
val subs = subsequences(tl)
subs.map(hd +: _) ++ subs
}
Note that this is not efficient for a number of reasons, so it is not a good way to solve the "longest increasing subsequence problem".
[ Original answer ]
This function will generate all the non-empty contiguous sub-sequences of any sequence:
def subsequences[T](seq: Seq[T]) =
seq.tails.flatMap(_.inits).filter(_.nonEmpty)
This returns an Iterator so it creates each sub-sequence in turn which reduces memory usage.
Note that this will generate all the sub-sequences and will preserve the order of the values, unlike solutions using combinations or Set.
You can use this in your "longest increasing subsequence problem" like this:
def isAscending(seq: Seq[Int]): Boolean =
seq.length <= 1 || seq.sliding(2).forall(x => x(0) < x(1))
subsequences(a).filter(isAscending).maxBy(_.length)
The result will be the longest sequence of ascending values in the input a.

Functional Programming way to calculate something like a rolling sum

Let's say I have a list of numerics:
val list = List(4,12,3,6,9)
For every element in the list, I need to find the rolling sum, i,e. the final output should be:
List(4, 16, 19, 25, 34)
Is there any transformation that allows us to take as input two elements of the list (the current and the previous) and compute based on both?
Something like map(initial)((curr,prev) => curr+prev)
I want to achieve this without maintaining any shared global state.
EDIT: I would like to be able to do the same kinds of computation on RDDs.
You may use scanLeft
list.scanLeft(0)(_ + _).tail
The cumSum method below should work for any RDD[N], where N has an implicit Numeric[N] available, e.g. Int, Long, BigInt, Double, etc.
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
def cumSum[N : Numeric : ClassTag](rdd: RDD[N]): RDD[N] = {
val num = implicitly[Numeric[N]]
val nPartitions = rdd.partitions.length
val partitionCumSums = rdd.mapPartitionsWithIndex((index, iter) =>
if (index == nPartitions - 1) Iterator.empty
else Iterator.single(iter.foldLeft(num.zero)(num.plus))
).collect
.scanLeft(num.zero)(num.plus)
rdd.mapPartitionsWithIndex((index, iter) =>
if (iter.isEmpty) iter
else {
val start = num.plus(partitionCumSums(index), iter.next)
iter.scanLeft(start)(num.plus)
}
)
}
It should be fairly straightforward to generalize this method to any associative binary operator with a "zero" (i.e. any monoid.) It is the associativity that is key for the parallelization. Without this associativity you're generally going to be stuck with running through the entries of the RDD in a serial fashion.
I don't know what functitonalities are supported by spark RDD, so I am not sure if this satisfies your conditions, because I don't know if zipWithIndex is supported (if the answer is not helpful, please let me know by a comment and I will delete my answer):
list.zipWithIndex.map{x => list.take(x._2+1).sum}
This code works for me, it sums up the elements. It gets the index of the list element, and then adds the corresponding n first elements in the list (notice the +1, since the zipWithIndex starts with 0).
When printing it, I get the following:
List(4, 16, 19, 25, 34)

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54

Constructing BitSets in Scala from a predicate?

Suppose I want to construct a BitSet containing all integers from 0 until n satisfying some predicate f: Int => Boolean.
I could write something like
BitSet((0 until n):_*).filter(f)
which of course works. But it feels rather inefficient! I'm planning on doing this inside a pretty tight loop, and would like suggestions for more efficient ways.
This is the best I could come up with at the moment
BitSet((0 until n).view.filter(f):_*)
The view part makes the filter method lazy. This makes sure that when the BitSet is created from the given sequence, it will filter on the fly. Your original suggestion creates a new BitSet after the first one is created.
If performance is truly your major concern, the best option is probably to use a mutable.BitSet and a while loop, and then call toImmutable on the result.
val bitSet = {
val tmp = new scala.collection.mutable.BitSet(n)
var i = 0;
while (i < n) {
if (f(i)) {
tmp += i
}
i = i + 1
}
tmp.toImmutable
}
I think the most efficient "functional" way is to use foldLeft:
(1 to 5).foldLeft(BitSet())((s,i) => if (f(i)) s + i else s)
It doesn't create an intermediate collection but construct the collection from scratch while filtering.
The first thing I thought is to use breakOut, but it doesn't work for filter:
scala> val set: BitSet = (0 until 10).filter(f)(collection.breakOut)
<console>:11: error: polymorphic expression cannot be instantiated to expected type;
found : [From, T, To]scala.collection.generic.CanBuildFrom[From,T,To]
required: Int
val set: BitSet = (0 until 10).filter(f)(collection.breakOut)
^
scala> val set: BitSet = (0 until 10).map(_+1)(collection.breakOut)
set: scala.collection.immutable.BitSet = BitSet(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
breakOut doesn't create an intermediate collection too, but because filter doesn't have a second parameter list it can't work.

Scala: what is the most appropriate data structure for sorted subsets?

Given a large collection (let's call it 'a') of elements of type T (say, a Vector or List) and an evaluation function 'f' (say, (T) => Double) I would like to derive from 'a' a result collection 'b' that contains the N elements of 'a' that result in the highest value under f. The collection 'a' may contain duplicates. It is not sorted.
Maybe leaving the question of parallelizability (map/reduce etc.) aside for a moment, what would be the appropriate Scala data structure for compiling the result collection 'b'? Thanks for any pointers / ideas.
Notes:
(1) I guess my use case can be most concisely expressed as
val a = Vector( 9,2,6,1,7,5,2,6,9 ) // just an example
val f : (Int)=>Double = (n)=>n // evaluation function
val b = a.sortBy( f ).take( N ) // sort, then clip
except that I do not want to sort the entire set.
(2) one option might be an iteration over 'a' that fills a TreeSet with 'manual' size bounding (reject anything worse than the worst item in the set, don't let the set grow beyond N). However, I would like to retain duplicates present in the original set in the result set, and so this may not work.
(3) if a sorted multi-set is the right data structure, is there a Scala implementation of this? Or a binary-sorted Vector or Array, if the result set is reasonably small?
You can use a priority queue:
def firstK[A](xs: Seq[A], k: Int)(implicit ord: Ordering[A]) = {
val q = new scala.collection.mutable.PriorityQueue[A]()(ord.reverse)
val (before, after) = xs.splitAt(k)
q ++= before
after.foreach(x => q += ord.max(x, q.dequeue))
q.dequeueAll
}
We fill the queue with the first k elements and then compare each additional element to the head of the queue, swapping as necessary. This works as expected and retains duplicates:
scala> firstK(Vector(9, 2, 6, 1, 7, 5, 2, 6, 9), 4)
res14: scala.collection.mutable.Buffer[Int] = ArrayBuffer(6, 7, 9, 9)
And it doesn't sort the complete list. I've got an Ordering in this implementation, but adapting it to use an evaluation function would be pretty trivial.