How to pick a random value from a collection in Scala - scala

I need a method to pick uniformly a random value from a collection.
Here is my current impl.
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = elements.toSeq(Random.nextInt(elements.size))
}
But this code instantiate a new collection, so not ideal in term of memory.
Any way to improve ?
[update] make it work with Iterator
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = {
val seq = elements.toSeq
seq(Random.nextInt(seq.size))
}
}

It may seem at first glance that you can't do this without counting the elements first, but you can!
Iterate through the sequence f and take each element fi with probability 1/i:
def choose[A](it: Iterator[A], r: util.Random): A =
it.zip(Iterator.iterate(1)(_ + 1)).reduceLeft((x, y) =>
if (r.nextInt(y._2) == 0) y else x
)._1
A quick demonstration of uniformity:
scala> ((1 to 1000000)
| .map(_ => choose("abcdef".iterator, r))
| .groupBy(identity).values.map(_.length))
res45: Iterable[Int] = List(166971, 166126, 166987, 166257, 166698, 166961)
Here's a discussion of the math I wrote a while back, though I'm afraid it's a bit unnecessarily long-winded. It also generalizes to choosing any fixed number of elements instead of just one.

Simplest way is just to think of the problem as zipping the collection with an equal-sized list of random numbers, and then just extract the maximum element. You can do this without actually realizing the zipped sequence. This does require traversing the entire iterator, though
val maxElement = s.maxBy(_=>Random.nextInt)
Or, for the implicit version
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = elements.maxBy(_=>Random.nextInt)
}

It's possible to select an element uniformly at random from a collection, traversing it once without copying the collection.
The following algorithm will do the trick:
def choose[A](elements: TraversableOnce[A]): A = {
var x: A = null.asInstanceOf[A]
var i = 1
for (e <- elements) {
if (Random.nextDouble <= 1.0 / i) {
x = e
}
i += 1
}
x
}
The algorithm works by at each iteration makes a choice: take the new element with probability 1 / i, or keep the previous one.
To understand why the algorithm choose the element uniformly at random, consider this: Start by considering an element in the collection, for example the first one (in this example the collection only has three elements).
At iteration:
Chosen with probability: 1.
Chosen with probability:
(probability of keeping the element at previous iteration) * (keeping at current iteration)
probability => 1 * 1/2 = 1/2
Chosen with probability: 1/2 * 2/3=1/3 (in other words, uniformly)
If we take another element, for example the second one:
0 (not possible to choose the element at this iteration).
1/2.
1/2*2/3=1/3.
Finally for the third one:
0.
0.
1/3.
This shows that the algorithm selects an element uniformly at random. This can be proved formally using induction.

If the collection is large enough that you care about about instantiations, here is the constant memory solution (I assume, it contains ints' but that only matters for passing initial param to fold):
collection.fold((0, 0)) {
case ((0, _), x) => (1, x)
case ((n, x), _) if (random.nextDouble() > 1.0/n) => (n+1, x)
case ((n, _), x) => (n+1, x)
}._2
I am not sure if this requires a further explanation ... Basically, it does the same thing that #svenslaggare suggested above, but in a functional way, since this is tagged as a scala question.

Related

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54

Scala: Best way to filter & map in one iteration

I'm new to Scala and trying to figure out the best way to filter & map a collection. Here's a toy example to explain my problem.
Approach 1: This is pretty bad since I'm iterating through the list twice and calculating the same value in each iteration.
val N = 5
val nums = 0 until 10
val sqNumsLargerThanN = nums filter { x: Int => (x * x) > N } map { x: Int => (x * x).toString }
Approach 2: This is slightly better but I still need to calculate (x * x) twice.
val N = 5
val nums = 0 until 10
val sqNumsLargerThanN = nums collect { case x: Int if (x * x) > N => (x * x).toString }
So, is it possible to calculate this without iterating through the collection twice and avoid repeating the same calculations?
Could use a foldRight
nums.foldRight(List.empty[Int]) {
case (i, is) =>
val s = i * i
if (s > N) s :: is else is
}
A foldLeft would also achieve a similar goal, but the resulting list would be in reverse order (due to the associativity of foldLeft.
Alternatively if you'd like to play with Scalaz
import scalaz.std.list._
import scalaz.syntax.foldable._
nums.foldMap { i =>
val s = i * i
if (s > N) List(s) else List()
}
The typical approach is to use an iterator (if possible) or view (if iterator won't work). This doesn't exactly avoid two traversals, but it does avoid creation of a full-sized intermediate collection. You then map first and filter afterwards and then map again if needed:
xs.iterator.map(x => x*x).filter(_ > N).map(_.toString)
The advantage of this approach is that it's really easy to read and, since there are no intermediate collections, it's reasonably efficient.
If you are asking because this is a performance bottleneck, then the answer is usually to write a tail-recursive function or use the old-style while loop method. For instance, in your case
def sumSqBigN(xs: Array[Int], N: Int): Array[String] = {
val ysb = Array.newBuilder[String]
def inner(start: Int): Array[String] = {
if (start >= xs.length) ysb.result
else {
val sq = xs(start) * xs(start)
if (sq > N) ysb += sq.toString
inner(start + 1)
}
}
inner(0)
}
You can also pass a parameter forward in inner instead of using an external builder (especially useful for sums).
I have yet to confirm that this is truly a single pass, but:
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
if (square > N) Some(x) else None
}
You can use collect which applies a partial function to every value of the collection that it's defined at. Your example could be rewritten as follows:
val sqNumsLargerThanN = nums collect {
case (x: Int) if (x * x) > N => (x * x).toString
}
A very simple approach that only does the multiplication operation once. It's also lazy, so it will be executing code only when needed.
nums.view.map(x=>x*x).withFilter(x => x> N).map(_.toString)
Take a look here for differences between filter and withFilter.
Consider this for comprehension,
for (x <- 0 until 10; v = x*x if v > N) yield v.toString
which unfolds to a flatMap over the range and a (lazy) withFilter onto the once only calculated square, and yields a collection with filtered results. To note one iteration and one calculation of square is required (in addition to creating the range).
You can use flatMap.
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
if (square > N) Some(square.toString) else None
}
Or with Scalaz,
import scalaz.Scalaz._
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
(square > N).option(square.toString)
}
The solves the asked question of how to do this with one iteration. This can be useful when streaming data, like with an Iterator.
However...if you are instead wanting the absolute fastest implementation, this is not it. In fact, I suspect you would use a mutable ArrayList and a while loop. But only after profiling would you know for sure. In any case, that's for another question.
Using a for comprehension would work:
val sqNumsLargerThanN = for {x <- nums if x*x > N } yield (x*x).toString
Also, I'm not sure but I think the scala compiler is smart about a filter before a map and will only do 1 pass if possible.
I am also beginner did it as follows
for(y<-(num.map(x=>x*x)) if y>5 ) { println(y)}

Scala, Erastothenes: Is there a straightforward way to replace a stream with an iteration?

I wrote a function that generates primes indefinitely (wikipedia: incremental sieve of Erastothenes) usings streams. It returns a stream, but it also merges streams of prime multiples internally to mark upcoming composites. The definition is concise, functional, elegant and easy to understand, if I do say so myself:
def primes(): Stream[Int] = {
def merge(a: Stream[Int], b: Stream[Int]): Stream[Int] = {
def next = a.head min b.head
Stream.cons(next, merge(if (a.head == next) a.tail else a,
if (b.head == next) b.tail else b))
}
def test(n: Int, compositeStream: Stream[Int]): Stream[Int] = {
if (n == compositeStream.head) test(n+1, compositeStream.tail)
else Stream.cons(n, test(n+1, merge(compositeStream, Stream.from(n*n, n))))
}
test(2, Stream.from(4, 2))
}
But, I get a "java.lang.OutOfMemoryError: GC overhead limit exceeded" when I try to generate the 1000th prime.
I have an alternative solution that returns an iterator over primes and uses a priority queue of tuples (multiple, prime used to generate multiple) internally to mark upcoming composites. It works well, but it takes about twice as much code, and I basically had to restart from scratch:
import scala.collection.mutable.PriorityQueue
def primes(): Iterator[Int] = {
// Tuple (composite, prime) is used to generate a primes multiples
object CompositeGeneratorOrdering extends Ordering[(Long, Int)] {
def compare(a: (Long, Int), b: (Long, Int)) = b._1 compare a._1
}
var n = 2;
val composites = PriorityQueue(((n*n).toLong, n))(CompositeGeneratorOrdering)
def advance = {
while (n == composites.head._1) { // n is composite
while (n == composites.head._1) { // duplicate composites
val (multiple, prime) = composites.dequeue
composites.enqueue((multiple + prime, prime))
}
n += 1
}
assert(n < composites.head._1)
val prime = n
n += 1
composites.enqueue((prime.toLong * prime.toLong, prime))
prime
}
Iterator.continually(advance)
}
Is there a straightforward way to translate the code with streams to code with iterators? Or is there a simple way to make my first attempt more memory efficient?
It's easier to think in terms of streams; I'd rather start that way, then tweak my code if necessary.
I guess it's a bug in current Stream implementation.
primes().drop(999).head works fine:
primes().drop(999).head
// Int = 7919
You'll get OutOfMemoryError with stored Stream like this:
val prs = primes()
prs.drop(999).head
// Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
The problem here with class Cons implementation: it contains not only calculated tail, but also a function to calculate this tail. Even when the tail is calculated and function is not needed any more!
In this case functions are extremely heavy, so you'll get OutOfMemoryError even with 1000 functions stored.
We have to drop that functions somehow.
Intuitive fix is failed:
val prs = primes().iterator.toStream
prs.drop(999).head
// Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
With iterator on Stream you'll get StreamIterator, with StreamIterator#toStream you'll get initial heavy Stream.
Workaround
So we have to convert it manually:
def toNewStream[T](i: Iterator[T]): Stream[T] =
if (i.hasNext) Stream.cons(i.next, toNewStream(i))
else Stream.empty
val prs = toNewStream(primes().iterator)
// Stream[Int] = Stream(2, ?)
prs.drop(999).head
// Int = 7919
In your first code, you should postpone the merging until the square of a prime is seen amongst the candidates. This will drastically reduce the number of streams in use, radically improving your memory usage issues. To get the 1000th prime, 7919, we only need to consider primes not above its square root, 88. That's just 23 primes/streams of their multiples, instead of 999 (22, if we ignore the evens from the outset). For the 10,000th prime, it's the difference between having 9999 streams of multiples and just 66. And for the 100,000th, only 189 are needed.
The trick is to separate the primes being consumed from the primes being produced, via a recursive invocation:
def primes(): Stream[Int] = {
def merge(a: Stream[Int], b: Stream[Int]): Stream[Int] = {
def next = a.head min b.head
Stream.cons(next, merge(if (a.head == next) a.tail else a,
if (b.head == next) b.tail else b))
}
def test(n: Int, q: Int,
compositeStream: Stream[Int],
primesStream: Stream[Int]): Stream[Int] = {
if (n == q) test(n+2, primesStream.tail.head*primesStream.tail.head,
merge(compositeStream,
Stream.from(q, 2*primesStream.head).tail),
primesStream.tail)
else if (n == compositeStream.head) test(n+2, q, compositeStream.tail,
primesStream)
else Stream.cons(n, test(n+2, q, compositeStream, primesStream))
}
Stream.cons(2, Stream.cons(3, Stream.cons(5,
test(7, 25, Stream.from(9, 6), primes().tail.tail))))
}
As an added bonus, there's no need to store the squares of primes as Longs. This will also be much faster and have better algorithmic complexity (time and space) as it avoids doing a lot of superfluous work. Ideone testing shows it runs at about ~ n1.5..1.6 empirical orders of growth in producing up to n = 80,000 primes.
There's still an algorithmic problem here: the structure that is created here is still a linear left-leaning structure (((mults_of_2 + mults_of_3) + mults_of_5) + ...), with more frequently-producing streams situated deeper inside it (so the numbers have more levels to percolate through, going up). The right-leaning structure should be better, mults_of_2 + (mults_of_3 + (mults_of_5 + ...)). Making it a tree should bring a real improvement in time complexity (pushing it down typically to about ~ n1.2..1.25). For a related discussion, see this haskellwiki page.
The "real" imperative sieve of Eratosthenes usually runs at around ~ n1.1 (in n primes produced) and an optimal trial division sieve at ~ n1.40..1.45. Your original code runs at about cubic time, or worse. Using imperative mutable array is usually the fastest, working by segments (a.k.a. the segmented sieve of Eratosthenes).
In the context of your second code, this is how it is achieved in Python.
Is there a straightforward way to translate the code with streams to code with iterators? Or is there a simple way to make my first attempt more memory efficient?
#Will Ness has given you an improved answer using Streams and given reasons why your code is taking so much memory and time as in adding streams early and a left-leaning linear structure, but no one has completely answered the second (or perhaps main) part of your question as to can a true incremental Sieve of Eratosthenes be implemented with Iterator's.
First, we should properly credit this right-leaning algorithm of which your first code is a crude (left-leaning) example (since it prematurely adds all prime composite streams to the merge operations), which is due to Richard Bird as in the Epilogue of Melissa E. O'Neill's definitive paper on incremental Sieve's of Eratosthenes.
Second, no, it isn't really possible to substitute Iterator's for Stream's in this algorithm as it depends on moving through a stream without restarting the stream, and although one can access the head of an iterator (the current position), using the next value (skipping over the head) to generate the rest of the iteration as a stream requires building a completely new iterator at a terrible cost in memory and time. However, we can use an Iterator to output the results of the sequence of primes in order to minimize memory use and make it easy to use iterator higher order functions, as you will see in my code below.
Now Will Ness has walked you though the principles of postponing adding prime composite streams to the calculations until they are needed, which works well when one is storing these in a structure such as a Priority Queue or a HashMap and was even missed in the O'Neill paper, but for the Richard Bird algorithm this is not necessary as future stream values will not be accessed until needed so are not stored if the Streams are being properly lazily built (as is lazily and left-leaning). In fact, this algorithm doesn't even need the memorization and overheads of a full Stream as each composite number culling sequence only moves forward without reference to any past primes other than one needs a separate source of the base primes, which can be supplied by a recursive call of the same algorithm.
For ready reference, let's list the Haskell code of the Richard Bird algorithms as follows:
primes = 2:([3..] ‘minus‘ composites)
where
composites = union [multiples p | p <− primes]
multiples n = map (n*) [n..]
(x:xs) ‘minus‘ (y:ys)
| x < y = x:(xs ‘minus‘ (y:ys))
| x == y = xs ‘minus‘ ys
| x > y = (x:xs) ‘minus‘ ys
union = foldr merge []
where
merge (x:xs) ys = x:merge’ xs ys
merge’ (x:xs) (y:ys)
| x < y = x:merge’ xs (y:ys)
| x == y = x:merge’ xs ys
| x > y = y:merge’ (x:xs) ys
In the following code I have simplified the 'minus' function (called "minusStrtAt") as we don't need to build a completely new stream but can incorporate the composite subtraction operation with the generation of the original (in my case odds only) sequence. I have also simplified the "union" function (renaming it as "mrgMltpls")
The stream operations are implemented as a non memoizing generic Co Inductive Stream (CIS) as a generic class where the first field of the class is the value of the current position of the stream and the second is a thunk (a zero argument function that returns the next value of the stream through embedded closure arguments to another function).
def primes(): Iterator[Long] = {
// generic class as a Co Inductive Stream element
class CIS[A](val v: A, val cont: () => CIS[A])
def mltpls(p: Long): CIS[Long] = {
var px2 = p * 2
def nxtmltpl(cmpst: Long): CIS[Long] =
new CIS(cmpst, () => nxtmltpl(cmpst + px2))
nxtmltpl(p * p)
}
def allMltpls(mps: CIS[Long]): CIS[CIS[Long]] =
new CIS(mltpls(mps.v), () => allMltpls(mps.cont()))
def merge(a: CIS[Long], b: CIS[Long]): CIS[Long] =
if (a.v < b.v) new CIS(a.v, () => merge(a.cont(), b))
else if (a.v > b.v) new CIS(b.v, () => merge(a, b.cont()))
else new CIS(b.v, () => merge(a.cont(), b.cont()))
def mrgMltpls(mlps: CIS[CIS[Long]]): CIS[Long] =
new CIS(mlps.v.v, () => merge(mlps.v.cont(), mrgMltpls(mlps.cont())))
def minusStrtAt(n: Long, cmpsts: CIS[Long]): CIS[Long] =
if (n < cmpsts.v) new CIS(n, () => minusStrtAt(n + 2, cmpsts))
else minusStrtAt(n + 2, cmpsts.cont())
// the following are recursive, where cmpsts uses oddPrms and
// oddPrms uses a delayed version of cmpsts in order to avoid a race
// as oddPrms will already have a first value when cmpsts is called to generate the second
def cmpsts(): CIS[Long] = mrgMltpls(allMltpls(oddPrms()))
def oddPrms(): CIS[Long] = new CIS(3, () => minusStrtAt(5L, cmpsts()))
Iterator.iterate(new CIS(2L, () => oddPrms()))
{(cis: CIS[Long]) => cis.cont()}
.map {(cis: CIS[Long]) => cis.v}
}
The above code generates the 100,000th prime (1299709) on ideone in about 1.3 seconds with about a 0.36 second overhead and has an empirical computational complexity to 600,000 primes of about 1.43. The memory use is negligible above that used by the program code.
The above code could be implemented using the built-in Scala Streams, but there is a performance and memory use overhead (of a constant factor) that this algorithm does not require. Using Streams would mean that one could use them directly without the extra Iterator generation code, but as this is used only for final output of the sequence, it doesn't cost much.
To implement some basic tree folding as Will Ness has suggested, one only needs to add a "pairs" function and hook it into the "mrgMltpls" function:
def primes(): Iterator[Long] = {
// generic class as a Co Inductive Stream element
class CIS[A](val v: A, val cont: () => CIS[A])
def mltpls(p: Long): CIS[Long] = {
var px2 = p * 2
def nxtmltpl(cmpst: Long): CIS[Long] =
new CIS(cmpst, () => nxtmltpl(cmpst + px2))
nxtmltpl(p * p)
}
def allMltpls(mps: CIS[Long]): CIS[CIS[Long]] =
new CIS(mltpls(mps.v), () => allMltpls(mps.cont()))
def merge(a: CIS[Long], b: CIS[Long]): CIS[Long] =
if (a.v < b.v) new CIS(a.v, () => merge(a.cont(), b))
else if (a.v > b.v) new CIS(b.v, () => merge(a, b.cont()))
else new CIS(b.v, () => merge(a.cont(), b.cont()))
def pairs(mltplss: CIS[CIS[Long]]): CIS[CIS[Long]] = {
val tl = mltplss.cont()
new CIS(merge(mltplss.v, tl.v), () => pairs(tl.cont()))
}
def mrgMltpls(mlps: CIS[CIS[Long]]): CIS[Long] =
new CIS(mlps.v.v, () => merge(mlps.v.cont(), mrgMltpls(pairs(mlps.cont()))))
def minusStrtAt(n: Long, cmpsts: CIS[Long]): CIS[Long] =
if (n < cmpsts.v) new CIS(n, () => minusStrtAt(n + 2, cmpsts))
else minusStrtAt(n + 2, cmpsts.cont())
// the following are recursive, where cmpsts uses oddPrms and
// oddPrms uses a delayed version of cmpsts in order to avoid a race
// as oddPrms will already have a first value when cmpsts is called to generate the second
def cmpsts(): CIS[Long] = mrgMltpls(allMltpls(oddPrms()))
def oddPrms(): CIS[Long] = new CIS(3, () => minusStrtAt(5L, cmpsts()))
Iterator.iterate(new CIS(2L, () => oddPrms()))
{(cis: CIS[Long]) => cis.cont()}
.map {(cis: CIS[Long]) => cis.v}
}
The above code generates the 100,000th prime (1299709) on ideone in about 0.75 seconds with about a 0.37 second overhead and has an empirical computational complexity to the 1,000,000th prime (15485863) of about 1.09 (5.13 seconds). The memory use is negligible above that used by the program code.
Note that the above codes are completely functional in that there is no mutable state used whatsoever, but that the Bird algorithm (or even the tree folding version) isn't as fast as using a Priority Queue or HashMap for larger ranges as the number of operations to handle the tree merging has a higher computational complexity than the log n overhead of the Priority Queue or the linear (amortized) performance of a HashMap (although there is a large constant factor overhead to handle the hashing so that advantage isn't really seen until some truly large ranges are used).
The reason that these codes use so little memory is that the CIS streams are formulated with no permanent reference to the start of the streams so that the streams are garbage collected as they are used, leaving only the minimal number of base prime composite sequence place holders, which as Will Ness has explained is very small - only 546 base prime composite number streams for generating the first million primes up to 15485863, each placeholder only taking a few 10's of bytes (eight for the Long number, eight for the 64-bit function reference, with another couple of eight bytes for the pointer to the closure arguments and another few bytes for function and class overheads, for a total per stream placeholder of perhaps 40 bytes, or a total of not much more than 20 Kilobytes for generating the sequence for a million primes).
If you just want an infinite stream of primes, this is the most elegant way in my opinion:
def primes = {
def sieve(from : Stream[Int]): Stream[Int] = from.head #:: sieve(from.tail.filter(_ % from.head != 0))
sieve(Stream.from(2))
}

Generate a DAG from a poset using stricly functional programming

Here is my problem: I have a sequence S of (nonempty but possibly not distinct) sets s_i, and for each s_i need to know how many sets s_j in S (i ≠ j) are subsets of s_i.
I also need incremental performance: once I have all my counts, I may replace one set s_i by some subset of s_i and update the counts incrementally.
Performing all this using purely functional code would be a huge plus (I code in Scala).
As set inclusion is a partial ordering, I thought the best way to solve my problem would be to build a DAG that would represent the Hasse diagram of the sets, with edges representing inclusion, and join an integer value to each node representing the size of the sub-dag below the node plus 1. However, I have been stuck for several days trying to develop the algorithm that builds the Hasse diagram from the partial ordering (let's not talk about incrementality!), even though I thought it would be some standard undergraduate material.
Here is my data structure :
case class HNode[A] (
val v: A,
val child: List[HNode[A]]) {
val rank = 1 + child.map(_.rank).sum
}
My DAG is defined by a list of roots and some partial ordering:
class Hasse[A](val po: PartialOrdering[A], val roots: List[HNode[A]]) {
def +(v: A): Hasse[A] = new Hasse[A](po, add(v, roots))
private def collect(v: A, roots: List[HNode[A]], collected: List[HNode[A]]): List[HNode[A]] =
if (roots == Nil) collected
else {
val (subsets, remaining) = roots.partition(r => po.lteq(r.v, v))
collect(v, remaining.map(_.child).flatten, subsets.filter(r => !collected.exists(c => po.lteq(r.v, c.v))) ::: collected)
}
}
I am pretty stuck here. The last I came up to add a new value v to the DAG is:
find all "root subsets" rs_i of v in the DAG, i.e., subsets of v such that no superset of rs_i is a subset of v. This can be done quite easily by performing a search (BFS or DFS) on the graph (collect function, possibly non-optimal or even flawed).
build the new node n_v, the children of which are the previously found rs_i.
Now, let's find out where n_v should be attached: for a given list of roots, find out supersets of v. If none are found, add n_v to the roots and remove subsets of n_v from the roots. Else, perform step 3 recursively on the supersets's children.
I have not yet implemented fully this algorithm, but it seems uncessarily circonvoluted and nonoptimal for my apparently simple problem. Is there some simpler algorithm available (Google was clueless on this)?
After some work, I finally ended up solving my problem, following my initial intuition. The collect method and rank evaluation were flawed, I rewrote them with tail-recursion as a bonus. Here is the code I obtained:
final case class HNode[A](
val v: A,
val child: List[HNode[A]]) {
val rank: Int = 1 + count(child, Set.empty)
#tailrec
private def count(stack: List[HNode[A]], c: Set[HNode[A]]): Int =
if (stack == Nil) c.size
else {
val head :: rem = stack
if (c(head)) count(rem, c)
else count(head.child ::: rem, c + head)
}
}
// ...
private def add(v: A, roots: List[HNode[A]]): List[HNode[A]] = {
val newNode = HNode(v, collect(v, roots, Nil))
attach(newNode, roots)
}
private def attach(n: HNode[A], roots: List[HNode[A]]): List[HNode[A]] =
if (roots.contains(n)) roots
else {
val (supersets, remaining) = roots.partition { r =>
// Strict superset to avoid creating cycles in case of equal elements
po.tryCompare(n.v, r.v) == Some(-1)
}
if (supersets.isEmpty) n :: remaining.filter(r => !po.lteq(r.v, n.v))
else {
supersets.map(s => HNode(s.v, attach(n, s.child))) ::: remaining
}
}
#tailrec
private def collect(v: A, stack: List[HNode[A]], collected: List[HNode[A]]): List[HNode[A]] =
if (stack == Nil) collected
else {
val head :: tail = stack
if (collected.exists(c => po.lteq(head.v, c.v))) collect(v, tail, collected)
else if (po.lteq(head.v, v)) collect(v, tail, head :: (collected.filter(c => !po.lteq(c.v, head.v))))
else collect(v, head.child ::: tail, collected)
}
I now must check some optimization:
- cut off branches with totally distinct sets when collecting subsets (as Rex Kerr suggested)
- see if sorting the sets by size improves the process (as mitchus suggested)
The following problem is to work the (worst case) complexity of the add() operation out.
With n the number of sets, and d the size of the largest set, the complexity will probably be O(n²d), but I hope it can be refined. Here is my reasoning: if all sets are distinct, the DAG will be reduced to a sequence of roots/leaves. Thus, every time I try to add a node to the data structure, I still have to check for inclusion with each node already present (both in collect and attach procedures). This leads to 1 + 2 + … + n = n(n+1)/2 ∈ O(n²) inclusion checks.
Each set inclusion test is O(d), hence the result.
Suppose your DAG G contains a node v for each set, with attributes v.s (the set) and v.count (the number of instances of the set), including a node G.root with G.root.s = union of all sets (where G.root.count=0 if this set never occurs in your collection).
Then to count the number of distinct subsets of s you could do the following (in a bastardized mixture of Scala, Python and pseudo-code):
sum(apply(lambda x: x.count, get_subsets(s, G.root)))
where
get_subsets(s, v) :
if(v.s is not a subset of s, {},
union({v} :: apply(v.children, lambda x: get_subsets(s, x))))
In my opinion though, for performance reasons you would be better off abandoning this kind of purely functional solution... it works well on lists and trees, but beyond that the going gets tough.

Testing whether an ordered infinite stream contains a value

I have an infinite Stream of primes primeStream (starting at 2 and increasing). I also have another stream of Ints s which increase in magnitude and I want to test whether each of these is prime.
What is an efficient way to do this? I could define
def isPrime(n: Int) = n == primeStream.dropWhile(_ < n).head
but this seems inefficient since it needs to iterate over the whole stream each time.
Implementation of primeStream (shamelessly copied from elsewhere):
val primeStream: Stream[Int] =
2 #:: primeStream.map{ i =>
Stream.from(i + 1)
.find{ j =>
primeStream.takeWhile{ k => k * k <= j }
.forall{ j % _ > 0 }
}.get
}
If the question is about implementing isPrime, then you should do as suggested by rossum, even with division costing more than equality test, and with primes being more dense for lower values of n, it would be asymptotically much faster. Moreover, it is very fast when testing non primes which have a small divisor (most numbers have)
It may be different if you want to test primality of elements of another increasing Stream. You may consider something akin to a merge sort. You did not state how you want to get your result, here as a stream of Boolean, but it should not be too hard to adapt to something else.
/**
* Returns a stream of boolean, whether element at the corresponding position
* in xs belongs in ys. xs and ys must both be increasing streams.
*/
def belong[A: Ordering](xs: Stream[A], ys: Stream[A]): Stream[Boolean] = {
if (xs.isEmpty) Stream.empty
else if (ys.isEmpty) xs.map(_ => true)
else Ordering[A].compare(xs.head, ys.head) match {
case less if less < 0 => false #:: belong(xs.tail, ys)
case equal if equal == 0 => true #:: belong(xs.tail, ys.tail)
case greater if greater > 0 => belong(xs, ys.tail)
}
}
So you may do belong(yourStream, primeStream)
Yet it is not obvious that this solution will be better than a separate testing of primality for each number in turn, stopping at square root. Especially if yourStream is fast increasing compared to primes, so you have to compute many primes in vain, just to keep up. And even less so if there is no reason to suspect that elements in yourStream tend to be primes or have only large divisors.
You only need to read your prime stream as far as sqrt(s).
As you retrieve each p from the prime stream check if p evenly divides s.
This will give you a trial division method of prime checking.
To solve the general question of determining whether an ordered finite list consisted entirely of element of an ordered but infinite stream:
The simplest way is
candidate.toSet subsetOf infiniteList.takeWhile( _ <= candidate.last).toSet
but if the candidate is large, that requires a lot of space and it is O(n log n) instead O(n) like it could be. The O(n) way is
def acontains(a : Int, b : Iterator[Int]) : Boolean = {
while (b hasNext) {
val c = b.next
if (c == a) {
return true
}
if (c > a) {
return false
}
}
return false
}
def scontains(candidate: List[Int], infiniteList: Stream[Int]) : Boolean = {
val it = candidate.iterator
val il = infiniteList.iterator
while (it.hasNext) {
if (!acontains(it.next, il)) {
return false
}
}
return true
}
(Incidentally, if some helpful soul could propose a more Scalicious way to write the foregoing, I'd appreciate it.)
EDIT:
In the comments, the inestimable Luigi Plinge pointed out that I could just write:
def scontains(candidate: List[Int], infiniteStream: Stream[Int]) = {
val it = infiniteStream.iterator
candidate.forall(i => it.dropWhile(_ < i).next == i)
}