Efficient way to select a subset of elements from a Seq in Scala - scala

I have a sequence
val input = Seq(1,3,4,5,9,11...)
I want to randomly select a subset of it. What is the fastest way.
I currently implement it like this:
//ratio is the percentage of the subgroup from the whole group
def randomSelect(ratio:Double): Boolean = {
val rr=scala.util.Random
if (rr.nextFloat() < ratio) true else false
}
val ratio = 0.3
val result = input.map(x=>(x, randomSelect(ratio))).filter(x._2).map(x=>x._1)
So I first attach a true/false label for each element, and filter out those false elements, and get back the subset of the sequence.
Is there any faster/ advantage way?

So there are basically two approaches to this:
select n elements at random
include or exclude each element with probability p
Your solution is the latter and can be simplified to:
l.filter(_ => r.nextFloat < p)
(I'm calling the list, l, the instance of Random r and your ratio p from here on out.)
If you wanted to sample exactly n elements you could do:
r.shuffle(l).take(n)
I compared these selecting 200 elements from a 1000 element list:
scala> val first = time{
| l.map(x => (x, r.nextFloat < p)).filter(_._2).map(_._1)
| }
Elapsed time: 3249507ns
scala> val second = time {
| r.shuffle(l).take(200)
| }
Elapsed time: 10640432ns
scala> val third = time{
| l.filter(_ => r.nextFloat < p)}
Elapsed time: 1689009ns
Dropping your extra two mapss appears to speed things up by about a third (which makes complete sense). The shuffle-and-take method is significantly slower, but does guarantee you a fixed number of elements.
I borrowed the timing function from here if you want to do a more rigorous investigation (i.e. average over many trials, rather than 1).

If your list isn't big, a simple filter as suggested by others should suffice:
list.filter(_ => Random.nextDouble < p)
In case you have a big list, the per-element call of Random could become the bottleneck. One approach to minimize the calls is to generate random gaps (0, 1, 2, ...) by which the data sampling will hop over the elements. Below is a simple implementation in Scala:
import scala.util.Random
import scala.math._
def gapSampling(list: List[Double], p: Double): List[Double] = {
def randomGap(p: Double): Double = {
val epsilon: Double = 1e-10
val u = max(Random.nextDouble, epsilon)
floor( log(u) / log(1 - p) )
}
#scala.annotation.tailrec
def samplingFcn(acc: List[Double], list: List[Double], p: Double): List[Double] = list match {
case Nil => acc
case _ =>
val gap = randomGap(p).toInt
val l = list.drop(gap + 1)
val accNew = l.headOption match {
case Some(e) => e :: acc
case None => acc
}
samplingFcn(accNew, l, p)
}
samplingFcn(List[Double](), list, p).reverse
}
val list = (1 to 100).toList.map(_.toDouble)
gapSampling(list, 0.3)
// res1: List[Double] = List(
// 2.0, 5.0, 7.0, 14.0, 15.0, 18.0, 20.0, 25.0, 26.0, 28.0, 33.0,
// 35.0, 42.0, 43.0, 47.0, 48.0, 50.0, 55.0, 56.0, 59.0, 62.0,
// 69.0, 72.0, 75.0, 76.0, 79.0, 82.0, 93.0, 96.0, 97.0, 98.0
// )
More details about such gap sampling can be found here.

Related

How can we sample from a list of strings based on their probability of occurrence in the list in Scala?

I have a List[(String, Double)] variable where the second element of tuple denotes the probability of the string in first element appearing in a corpus. An example would be [(Apple, 0.2), (Banana, 0.3), (Lemon, 0.5)] where an Apple appears with a probability of 0.2 in the list of strings. I want to randomly sample from the list of strings based on their probability of appearance something along the lines of numpy random.choice() method. What would be the correct way to do this in Scala?
Another solution:
def choice(samples: Seq[(String, Double)], n: Int): Seq[String] = {
val (strings, probs) = samples.unzip
val cumprobs = probs.scanLeft(0.0){ _ + _ }.init
def p2s(p: Double): String = strings(cumprobs.lastIndexWhere(_ <= p))
Seq.fill(n)(math.random).map(p2s)
}
An usage (and verify):
>> val ss = choice(Seq(("Apple", 0.2), ("Banana", 0.3), ("Lemon", 0.5)), 10000)
>> ss.groupBy(identity).map{ case(k, v) => (k, v.size)}
Map[String, Int] = Map(Banana -> 3013, Lemon -> 4971, Apple -> 2016)
A very naive (and inefficient) solution would be to create a List of 100 elements that repeats each of the original elements the amount of times needed to respect its probabilities. Then you can randomly shuffle that List and finally take the first element.
import scala.util.Random
final val percent_100 = BigDecimal(100)
def choice[T](data: List[(T, Double)]): T = {
val distribution = data.flatMap {
case (elem, probability) =>
val scaledProbability = BigDecimal(probability).setScale(
scale = 2,
BigDecimal.RoundingMode.HALF_EVEN
)
val n = (scaledProbability * percent_100).toIntExact
List.fill(n)(elem)
}
Random.shuffle(distribution).head
}
However, I am sure there should be better ways of solving this.

Splitting a Scala Range into evenly-sized contiguous sub-Ranges

If I have a Range, how can I split it into a sequence of contiguous sub-ranges, where the number of sub-ranges (buckets) is specified? Empty buckets should be omitted if there are not enough items.
For example:
splitRange(1 to 6, 3) == Seq(Range(1,2), Range(3,4), Range(5,6))
splitRange(1 to 2, 3) == Seq(Range(1), Range(2))
Some additional constraints, that rule out some of the solutions I've seen:
Roughly even bucket size - the bucket size should vary by 1, at most
The length of the input range may sometimes be very large, so the ranges should not be materialized into sequences (e.g. can't use grouped)
This also implies that we don't allocate numbers to buckets in round-robin fashion, because then numbers in each bucket wouldn't be contiguous and so wouldn't form a Range
Ideally, the sub-ranges would be produced in order, i.e (1,2)(3,4), not (3,4)(1,2)
A colleague found a solution here:
def splitRange(r: Range, chunks: Int): Seq[Range] = {
if (r.step != 1)
throw new IllegalArgumentException("Range must have step size equal to 1")
val nchunks = scala.math.max(chunks, 1)
val chunkSize = scala.math.max(r.length / nchunks, 1)
val starts = r.by(chunkSize).take(nchunks)
val ends = starts.map(_ - 1).drop(1) :+ r.end
starts.zip(ends).map(x => x._1 to x._2)
}
but this can produce very uneven bucket sizes when N is small, e.g:
splitRange(1 to 14, 5)
//> Vector(Range(1, 2), Range(3, 4), Range(5, 6),
//| Range(7, 8), Range(9, 10, 11, 12, 13, 14))
^^^^^^^^^^^^^^^^^^^^^
Floating-point approaches
One way is to generate a fractional (floating-point) offset for each bucket, then convert these to integer Ranges, by zipping. Empty Ranges also need filtering out using collect.
def splitRange(r: Range, chunks: Int): Seq[Range] = {
require(r.step == 1, "Range must have step size equal to 1")
require(chunks >= 1, "Must ask for at least 1 chunk")
val m = r.length.toDouble
val chunkSize = m / chunks
val bins = (0 to chunks).map { x => math.round((x.toDouble * m) / chunks).toInt }
val pairs = bins zip (bins.tail)
pairs.collect { case (a, b) if b > a => a to b }
}
(The first version of this solution had a rounding problem such that it could not handle Int.MaxValue - this has now been fixed based on Rex Kerr's recursive floating-point solution below)
Another floating-point approach is to recurse down the range, taking the head off the range each time, so we cannot miss any elements. This version can handle Int.MaxValue correctly.
def splitRange(r: Range, chunks: Int): Seq[Range] = {
require(r.step == 1, "Range must have step size equal to 1")
require(chunks >= 1, "Must ask for at least 1 chunk")
val chunkSize = r.length.toDouble / chunks
def go(i: Int, r: Range, delta: Double, acc: List[Range]): List[Range] = {
if (i == chunks) r :: acc
// ensures the last chunk has all remaining values, even if error accumulates
else {
val s = delta + chunkSize
val (chunk, rest) = r.splitAt(s.toInt)
go(i + 1, rest, s - s.toInt, if (chunk.length > 0) chunk :: acc else acc)
}
}
go(1, r, 0.0D, Nil).reverse
}
One can also recurse to generate the (start,end) pairs, rather than zipping them. This is adapted from Rex Kerr's answer to a similar question
def splitRange(r: Range, chunks: Int): Seq[Range] = {
require(r.step == 1, "Range must have step size equal to 1")
require(chunks >= 1, "Must ask for at least 1 chunk")
val m = r.length
val bins = (0 to chunks).map { x => math.round((x.toDouble * m) / chunks).toInt }
def snip(r: Range, ns: Seq[Int], got: Vector[Range]): Vector[Range] = {
if (ns.length < 2) got
else {
val (i, j) = (ns.head, ns.tail.head)
snip(r.drop(j - i), ns.tail, got :+ r.take(j - i))
}
}
snip(r, bins, Vector.empty).filter(_.length > 0)
}
Integer approach
Finally, I realized that this can be done with purely integer arithmetic by adapting Bresenham's line-drawing algorithm, which solves a basically equivalent problem - how to allocate the x-pixels evenly across the y rows, using only integer operations!
I initially translated the pseudo-code into an imperative solution using var and ArrayBuffer, then converted it into a tail-recursive solution:
def splitRange(r: Range, chunks: Int): List[Range] = {
require(r.step == 1, "Range must have step size equal to 1")
require(chunks >= 1, "Must ask for at least 1 chunk")
val dy = r.length
val dx = chunks
#tailrec
def go(y0:Int, y:Int, d:Int, ch:Int, acc: List[Range]):List[Range] = {
if (ch == 0) acc
else {
if (d > 0) go(y0, y-1, d-dx, ch, acc)
else go(y-1, y, d+dy, ch-1, if (y > y0) acc
else (y to y0) :: acc)
}
}
go(r.end, r.end, dy - dx, chunks, Nil)
}
Please see the Wikipedia link for a full explanation, but essentially the algorithm zig-zags up the slope of a line, alternatively adding the y-range dy and subtracting the x-range dx. If these don't divide exactly, then an error accumulates until it divides exactly, leading to an extra pixel in some sub-ranges.
splitRange(3 to 15, 5)
//> List(Range(3, 4), Range(5, 6, 7), Range(8, 9),
//| Range(10, 11, 12), Range(13, 14, 15))

Efficient Scala idiomatic way to pick top 85 percent of sorted values?

Given a non-increasing list of numbers, I want to pick top 85% of values in the list. Here is how I am currently doing it.
scala> val a = Array(8.60, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
a: Array[Double] = Array(8.6, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
scala> val threshold = a.sum * 0.85
threshold: Double = 26.2565
scala> val successiveSums = a.tail.foldLeft(Array[Double](a.head)){ case (x,y) => x ++ Array(y + x.last) }
successiveSums: Array[Double] = Array(8.6, 15.45, 20.36, 23.81, 26.549999999999997, 28.609999999999996, 30.139999999999997, 30.49, 30.77, 30.89)
scala> successiveSums.takeWhile( x => x <= threshold )
res40: Array[Double] = Array(8.6, 15.45, 20.36, 23.81)
scala> val size = successiveSums.takeWhile( x => x <= threshold ).size
size: Int = 4
scala> a.take(size)
res41: Array[Double] = Array(8.6, 6.85, 4.91, 3.45)
I want improve its
performance
code-size
Any suggestions ?
On code size, consider this oneliner,
a.take( a.scanLeft(0.0)(_+_).takeWhile( _ <= a.sum * 0.85 ).size - 1 )
Here scanLeft accumulates additions.
On performance, tagging intermediate values may help not to recompute same operations, namely
val threshold = a.sum * 0.85
val size = a.scanLeft(0.0)(_+_).takeWhile( _ <= threshold ).size - 1
a.take( size )
There is some space for improvement in elm's answer:
1) You don't need to compute sum 2 times.
2) You can avoid creation of additional collection with takeWhile method and use indexWhere instead.
val sums = a.scanLeft(0.0)(_ + _)
a.take(sums.indexWhere(_ > sums.last * 0.85) - 1)
There's no library method that will do exactly what you want. Generally if you want something that performs well, you'd use a tail-recursive method both to find the sum and to find the point where the 85th percentile of the total sum is crossed. Something like
def threshold(
xs: Array[Double], thresh: Double,
i: Int = 0, sum: Double = 0
) {
val next = sum + x(i)
if (next > thresh) xs.take(i)
else threshold(xs, thresh, i+1, next)
}
Well in this case I would make a little use of mutable state. See the code below:
val a = Array(8.60, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
def f(a: Array[Double]) = {
val toGet = a.sum * 0.85
var sum = 0.0
a.takeWhile(x => {sum += x; sum <= toGet })
}
println(f(a).deep) //Array(8.6, 6.85, 4.91, 3.45)
In my opinion it's acceptable because function f does not have any side effects

Monte Carlo calculation of Pi in Scala

Suppose I would like to calculate Pi with Monte Carlo simulation as an exercise.
I am writing a function, which picks a point in a square (0, 1), (1, 0) at random and tests if the point is inside the circle.
import scala.math._
import scala.util.Random
def circleTest() = {
val (x, y) = (Random.nextDouble, Random.nextDouble)
sqrt(x*x + y*y) <= 1
}
Then I am writing a function, which takes as arguments the test function and the number of trials and returns the fraction of the trials in which the test was found to be true.
def monteCarlo(trials: Int, test: () => Boolean) =
(1 to trials).map(_ => if (test()) 1 else 0).sum * 1.0 / trials
... and I can calculate Pi
monteCarlo(100000, circleTest) * 4
Now I wonder if monteCarlo function can be improved. How would you write monteCarlo efficient and readable ?
For example, since the number of trials is large is it worth using a view or iterator instead of Range(1, trials) and reduce instead of map and sum ?
It's worth noting that Random.nextDouble is side-effecting—when you call it it changes the state of the random number generator. This may not be a concern to you, but since there are already five answers here I figure it won't hurt anything to add one that's purely functional.
First you'll need a random number generation monad implementation. Luckily NICTA provides a really nice one that's integrated with Scalaz. You can use it like this:
import com.nicta.rng._, scalaz._, Scalaz._
val pointInUnitSquare = Rng.choosedouble(0.0, 1.0) zip Rng.choosedouble(0.0, 1.0)
val insideCircle = pointInUnitSquare.map { case (x, y) => x * x + y * y <= 1 }
def mcPi(trials: Int): Rng[Double] =
EphemeralStream.range(0, trials).foldLeftM(0) {
case (acc, _) => insideCircle.map(_.fold(1, 0) + acc)
}.map(_ / trials.toDouble * 4)
And then:
scala> val choosePi = mcPi(10000000)
choosePi: com.nicta.rng.Rng[Double] = com.nicta.rng.Rng$$anon$3#16dd554f
Nothing's been computed yet—we've just built up a computation that will generate our value randomly when executed. Let's just execute it on the spot in the IO monad for the sake of convenience:
scala> choosePi.run.unsafePerformIO
res0: Double = 3.1415628
This won't be the most performant solution, but it's good enough that it may not be a problem for many applications, and the referential transparency may be worth it.
Stream based version, for another alternative. I think this is quite clear.
def monteCarlo(trials: Int, test: () => Boolean) =
Stream
.continually(if (test()) 1.0 else 0.0)
.take(trials)
.sum / trials
(the sum isn't specialised for streams but the implementation (in TraversableOnce) just calls foldLeft that is specialised and "allows GC to collect along the way." So the .sum won't force the stream to be evaluated and so won't keep all the trials in memory at once)
I see no problem with the following recursive version:
def monteCarlo(trials: Int, test: () => Boolean) = {
def bool2double(b: Boolean) = if (b) 1.0d else 0.0d
#scala.annotation.tailrec
def recurse(n: Int, sum: Double): Double =
if (n <= 0) sum / trials
else recurse(n - 1, sum + bool2double(test()))
recurse(trials, 0.0d)
}
And a foldLeft version, too:
def monteCarloFold(trials: Int, test: () => Boolean) =
(1 to trials).foldLeft(0.0d)((s,i) => s + (if (test()) 1.0d else 0.0d)) / trials
This is more memory efficient than the map version in the question.
Using tail recursion might be an idea:
def recMonteCarlo(trials: Int, currentSum: Double, test:() => Boolean):Double = trials match {
case 0 => currentSum
case x =>
val nextSum = currentSum + (if (test()) 1.0 else 0.0)
recMonteCarlo(trials-1, nextSum, test)
def monteCarlo(trials: Int, test:() => Boolean) = {
val monteSum = recMonteCarlo(trials, 0, test)
monteSum / trials
}
Using aggregate on a parallel collection, like this,
def monteCarlo(trials: Int, test: () => Boolean) = {
val pr = (1 to trials).par
val s = pr.aggregate(0)( (a,_) => a + (if (test()) 1 else 0), _ + _)
s * 4.0 / trials
}
where partial results are summed up in parallel with other test calculations.

Find min and max elements of array

I want to find the min and max elements of an array using for comprehension. Is it possible to do that with one iteration of array to find both min element and max element?
I am looking for a solution without using scala provided array.min or max.
You can get min and max values of an Array[Int] with reduceLeft function.
scala> val a = Array(20, 12, 6, 15, 2, 9)
a: Array[Int] = Array(20, 12, 6, 15, 2, 9)
scala> a.reduceLeft(_ min _)
res: Int = 2
scala> a.reduceLeft(_ max _)
res: Int = 20
See this link for more information and examples of reduceLeft method: http://alvinalexander.com/scala/scala-reduceleft-examples
Here is a concise and readable solution, that avoids the ugly if statements :
def minMax(a: Array[Int]) : (Int, Int) = {
if (a.isEmpty) throw new java.lang.UnsupportedOperationException("array is empty")
a.foldLeft((a(0), a(0)))
{ case ((min, max), e) => (math.min(min, e), math.max(max, e))}
}
Explanation : foldLeft is a standard method in Scala on many collections. It allows to pass an accumulator to a callback function that will be called for each element of the array.
Take a look at scaladoc for further details
def findMinAndMax(array: Array[Int]) = { // a non-empty array
val initial = (array.head, array.head) // a tuple representing min-max
// foldLeft takes an initial value of type of result, in this case a tuple
// foldLeft also takes a function of 2 parameters.
// the 'left' parameter is an accumulator (foldLeft -> accum is left)
// the other parameter is a value from the collection.
// the function2 should return a value which replaces accumulator (in next iteration)
// when the next value from collection will be picked.
// so on till all values are iterated, in the end accum is returned.
array.foldLeft(initial) { ((min, max), x) =>
if (x < min) (x, max)
else if (x > max) (min, x)
else acc
}
}
Following on from the other answers - a more general solution is possible, that works for other collections as well as Array, and other contents as well as Int:
def minmax[B >: A, A](xs: Iterable[A])(implicit cmp: Ordering[B]): (A, A) = {
if (xs.isEmpty) throw new UnsupportedOperationException("empty.minmax")
val initial = (xs.head, xs.head)
xs.foldLeft(initial) { case ((min, max), x) =>
(if (cmp.lt(x, min)) x else min, if (cmp.gt(x, max)) x else max) }
}
For example:
minmax(List(4, 3, 1, 2, 5)) //> res0: (Int, Int) = (1,5)
minmax(Vector('Z', 'K', 'B', 'A')) //> res1: (Char, Char) = (A,Z)
minmax(Array(3.0, 2.0, 1.0)) //> res2: (Double, Double) = (1.0,3.0)
(It's also possible to write this a bit more concisely using cmp.min() and cmp.max(), but only if you remove the B >: A type bound, which makes the function less general).
Consider this (for non-empty orderable arrays),
val ys = xs.sorted
val (min,max) = (ys.head, ys.last)
val xs: Array[Int] = ???
var min: Int = Int.MaxValue
var max: Int = Int.MinValue
for (x <- xs) {
if (x < min) min = x
if (x > max) max = x
}
I'm super late to the party on this one, but I'm new to Scala and thought I'd contribute anyway. Here is a solution using tail recursion:
#tailrec
def max(list: List[Int], currentMax: Int = Int.MinValue): Int = {
if(list.isEmpty) currentMax
else if ( list.head > currentMax) max(list.tail, list.head)
else max(list.tail,currentMax)
}
Of all of the answers I reviewed to this questions, DNA's solution was the closest to "Scala idiomatic" I could find. However, it can be slightly improved by...:
Performing as few comparisons as needed (important for very large collections)
Provide ideal ordering consistency by only using the Ordering.lt method
Avoiding throwing an Exception
Making the code more readable for those new to and learning Scala
The comments should help clarify the changes.
def minAndMax[B>: A, A](iterable: Iterable[A])(implicit ordering: Ordering[B]): Option[(A, A)] =
if (iterable.nonEmpty)
Some(
iterable.foldLeft((iterable.head, iterable.head)) {
case (minAndMaxTuple, element) =>
val (min, max) =
minAndMaxTuple //decode reference to tuple
if (ordering.lt(element, min))
(element, max) //if replacing min, it isn't possible max will change so no need for the max comparison
else
if (ordering.lt(max, element))
(min, element)
else
minAndMaxTuple //use original reference to avoid instantiating a new tuple
}
)
else
None
And here's the solution expanded to return the lower and upper bounds of a 2d space in a single pass, again using the above optimizations:
def minAndMax2d[B >: A, A](iterable: Iterable[(A, A)])(implicit ordering: Ordering[B]): Option[((A, A), (A, A))] =
if (iterable.nonEmpty)
Some(
iterable.foldLeft(((iterable.head._1, iterable.head._1), (iterable.head._2, iterable.head._2))) {
case ((minAndMaxTupleX, minAndMaxTupleY), (elementX, elementY)) =>
val ((minX, maxX), (minY, maxY)) =
(minAndMaxTupleX, minAndMaxTupleY) //decode reference to tuple
(
if (ordering.lt(elementX, minX))
(elementX, maxX) //if replacing minX, it isn't possible maxX will change so no need for the maxX comparison
else
if (ordering.lt(maxX, elementX))
(minX, elementX)
else
minAndMaxTupleX //use original reference to avoid instantiating a new tuple
, if (ordering.lt(elementY, minY))
(elementY, maxY) //if replacing minY, it isn't possible maxY will change so no need for the maxY comparison
else
if (ordering.lt(maxY, elementY))
(minY, elementY)
else
minAndMaxTupleY //use original reference to avoid instantiating a new tuple
)
}
)
else
None
You could always write your own foldLeft function - that will guarantee one iteration and known performance.
val array = Array(3,4,62,8,9,2,1)
if(array.isEmpty) throw new IllegalArgumentException // Just so we can safely call array.head
val (minimum, maximum) = array.foldLeft((array.head, array.head)) { // We start of with the first element as min and max
case ((min, max), next) =>
if(next > max) (min, next)
else if(next < min) (next, max)
else (min, max)
}
println(minimum, maximum) //1, 62
scala> val v = Vector(1,2)
scala> v.max
res0: Int = 2
scala> v.min
res1: Int = 2
You could use the min and max methods of Vector