Count how many times numbers from a list occur in a list of tupled intervals in Scala - scala

Say I have a list of tuples:
val ranges= List((1,4), (5,8), (9,10))
and a list of numbers
val nums = List(2,2,3,7,8,9)
I want to make a map from tuple in ranges to how many times a given number from nums fall into the interval of that tuple.
Output:
Map ((1,4) -> 3, (5,8) -> 2, (9,10) -> 1)
What is the best way to go about it in Scala
I have been trying to use for loops and keeping a counter but am falling short.

Something like this:
val ranges = List((1, 4), (5, 8), (9, 10))
val nums = List(2, 2, 3, 7, 8, 9)
val occurences = ranges.map { case (l, r) => nums.count((l to r) contains _) }
val map = (ranges zip occurences).toMap
println(map) // Map((1,4) -> 3, (5,8) -> 2, (9,10) -> 1)
Basically it first calculates the number of occurrences, [3, 2, 1]. From there it's easy to construct a map. And the way it calculates the occurrences is:
go through the list of ranges
transform each range into number of occurrences for that range, which is done like this :
how many numbers from the list nums are contained in that range?

Here is an efficient single-pass solution:
ranges
.map(r => r -> nums.count(n => n >= r._1 && n <= r._2))
.toMap
This avoids the overhead of creating a list of numbers and then zipping them with the ranges in a separate step.
This is a version that uses more Scala features but is a bit too fancy:
(for {
r <- ranges
range = r._1 to r._2
} yield r -> nums.count(range.contains)
).toMap
This is also less efficient because contains has to allow for ranges with a step value and is therefore more complicated.
And here is an even more efficient version that avoids any temporary data structures:
val result: Map[(Int, Int), Int] =
ranges.map(r => r -> nums.count(n => n >= r._1 && n <= r._2))(collection.breakOut)
See this explanation of breakOut if you are not familiar with it. Using breakOut means that the map call will build the Map directly rather than creating a List that has to be converted to a Map using toMap.

Related

Sum of Values based on key in scala

I am new to scala I have List of Integers
val list = List((1,2,3),(2,3,4),(1,2,3))
val sum = list.groupBy(_._1).mapValues(_.map(_._2)).sum
val sum2 = list.groupBy(_._1).mapValues(_.map(_._3)).sum
How to perform N values I tried above but its not good way how to sum N values based on key
Also I have tried like this
val sum =list.groupBy(_._1).values.sum => error
val sum =list.groupBy(_._1).mapvalues(_.map(_._2).sum (_._3).sum) error
It's easier to convert these tuples to List[Int] with shapeless and then work with them. Your tuples are actually more like lists anyways. Also, as a bonus, you don't need to change your code at all for lists of Tuple4, Tuple5, etc.
import shapeless._, syntax.std.tuple._
val list = List((1,2,3),(2,3,4),(1,2,3))
list.map(_.toList) // convert tuples to list
.groupBy(_.head) // group by first element of list
.mapValues(_.map(_.tail).map(_.sum).sum) // sums elements of all tails
Result is Map(2 -> 7, 1 -> 10).
val sum = list.groupBy(_._1).map(i => (i._1, i._2.map(j => j._1 + j._2 + j._3).sum))
> sum: scala.collection.immutable.Map[Int,Int] = Map(2 -> 9, 1 -> 12)
Since tuple can't type safe convert to List, need to specify add one by one as j._1 + j._2 + j._3.
using the first element in the tuple as the key and the remaining elements as what you need you could do something like this:
val list = List((1,2,3),(2,3,4),(1,2,3))
list: List[(Int, Int, Int)] = List((1, 2, 3), (2, 3, 4), (1, 2, 3))
val sum = list.groupBy(_._1).map { case (k, v) => (k -> v.flatMap(_.productIterator.toList.drop(1).map(_.asInstanceOf[Int])).sum) }
sum: Map[Int, Int] = Map(2 -> 7, 1 -> 10)
i know its a bit dirty to do asInstanceOf[Int] but when you do .productIterator you get a Iterator of Any
this will work for any tuple size

How to sum adjacent elements in scala

I want to sum adjacent elements in scala and I'm not sure how to deal with the last element.
So I have a list:
val x = List(1,2,3,4)
And I want to sum adjacent elements using indices and map:
val size = x.indices.size
val y = x.indices.map(i =>
if (i < size - 1)
x(i) + x(i+1))
The problem is that this approach creates an AnyVal elemnt at the end:
res1: scala.collection.immutable.IndexedSeq[AnyVal] = Vector(3, 5, 7, ())
and if I try to sum the elements or another numeric method of the collection, it doesn't work:
error: could not find implicit value for parameter num: Numeric[AnyVal]
I tried to filter out the element using:
y diff List(Unit) or y diff List(AnyVal)
but it doesn't work.
Is there a better approach in scala to do this type of adjacent sum without using a foor loop?
For a more functional solution, you can use sliding to group the elements together in twos (or any number of them), then map to their sum.
scala> List(1, 2, 3, 4).sliding(2).map(_.sum).toList
res80: List[Int] = List(3, 5, 7)
What sliding(2) will do is create an intermediate iterator of lists like this:
Iterator(
List(1, 2),
List(2, 3),
List(3, 4)
)
So when we chain map(_.sum), we will map each inner List to it's own sum. toList will convert the Iterator back into a List.
You can try pattern matching and tail recursion also.
import scala.annotation.tailrec
#tailrec
def f(l:List[Int],r :List[Int]=Nil):List[Int] = {
l match {
case x :: xs :: xss =>
f(l.tail, r :+ (x + xs))
case _ => r
}
}
scala> f(List(1,2,3,4))
res4: List[Int] = List(3, 5, 7)
With a for comprehension by zipping two lists, the second with the first item dropped,
for ( (a,b) <- x zip x.drop(1) ) yield a+b
which results in
List(3, 5, 7)

Spark: produce RDD[(X, X)] of all possible combinations from RDD[X]

Is it possible in Spark to implement '.combinations' function from scala collections?
/** Iterates over combinations.
*
* #return An Iterator which traverses the possible n-element combinations of this $coll.
* #example `"abbbc".combinations(2) = Iterator(ab, ac, bb, bc)`
*/
For example how can I get from RDD[X] to RDD[List[X]] or RDD[(X,X)] for combinations of size = 2. And lets assume that all values in RDD are unique.
Cartesian product and combinations are two different things, the cartesian product will create an RDD of size rdd.size() ^ 2 and combinations will create an RDD of size rdd.size() choose 2
val rdd = sc.parallelize(1 to 5)
val combinations = rdd.cartesian(rdd).filter{ case (a,b) => a < b }`.
combinations.collect()
Note this will only work if an ordering is defined on the elements of the list, since we use <. This one only works for choosing two but can easily be extended by making sure the relationship a < b for all a and b in the sequence
This is supported natively by a Spark RDD with the cartesian transformation.
e.g.:
val rdd = sc.parallelize(1 to 5)
val cartesian = rdd.cartesian(rdd)
cartesian.collect
Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (1,4), (1,5),
(2,1), (2,2), (2,3), (2,4), (2,5),
(3,1), (3,2), (3,3), (3,4), (3,5),
(4,1), (4,2), (4,3), (4,4), (4,5),
(5,1), (5,2), (5,3), (5,4), (5,5))
As discussed, cartesian will give you n^2 elements of the cartesian product of the RDD with itself.
This algorithm computes the combinations (n,2) of an RDD without having to compute the n^2 elements first: (used String as type, generalizing to a type T takes some plumbing with classtags that would obscure the purpose here)
This is probably less time efficient that cartesian + filtering due to the iterative count and take actions that forces the computation of the RDD, but more space efficient as it calculates only the C(n,2) = n!/(2*(n-2))! = (n*(n-1)/2) elements instead of the n^2 of the cartesian product.
import org.apache.spark.rdd._
def combs(rdd:RDD[String]):RDD[(String,String)] = {
val count = rdd.count
if (rdd.count < 2) {
sc.makeRDD[(String,String)](Seq.empty)
} else if (rdd.count == 2) {
val values = rdd.collect
sc.makeRDD[(String,String)](Seq((values(0), values(1))))
} else {
val elem = rdd.take(1)
val elemRdd = sc.makeRDD(elem)
val subtracted = rdd.subtract(elemRdd)
val comb = subtracted.map(e => (elem(0),e))
comb.union(combs(subtracted))
}
}
This creates all combinations (n, 2) and works for any RDD without requiring any ordering on the elements of RDD.
val rddWithIndex = rdd.zipWithIndex
rddWithIndex.cartesian(rddWithIndex).filter{case(a, b) => a._2 < b._2}.map{case(a, b) => (a._1, b._1)}
a._2 and b._2 are the indices, while a._1 and b._1 are the elements of the original RDD.
Example:
Note that, no ordering is defined on the maps here.
val m1 = Map('a' -> 1, 'b' -> 2)
val m2 = Map('c' -> 3, 'a' -> 4)
val m3 = Map('e' -> 5, 'c' -> 6, 'b' -> 7)
val rdd = sc.makeRDD(Array(m1, m2, m3))
val rddWithIndex = rdd.zipWithIndex
rddWithIndex.cartesian(rddWithIndex).filter{case(a, b) => a._2 < b._2}.map{case(a, b) => (a._1, b._1)}.collect
Output:
Array((Map(a -> 1, b -> 2),Map(c -> 3, a -> 4)), (Map(a -> 1, b -> 2),Map(e -> 5, c -> 6, b -> 7)), (Map(c -> 3, a -> 4),Map(e -> 5, c -> 6, b -> 7)))

How to convert an Iterator[int] to a map with frequency per bin in scala

I just learned how to convert a list of integers to a map with frequency per bin in scala.
How to convert list of integers to a map with frequency per bin in scala
However I am working with a 22 GB file therefore I am streaming trough the file.
Source.fromFile("test.txt").getLines.filter(x => x.charAt(0) != '#').map(x => x.split("\t")(1)).map(x => x.toInt)
The groupby function only works on a list, not on an iterator. I guess because it needs all the values in memory. I can't convert the iterator to an list because of the file size.
So an example would be
List(1,2,3,101,330,302).iterator
And how can I go from there to
res1: scala.collection.immutable.Map[Int,Int] = Map(100 -> 1, 300 -> 2, 0 -> 3)
You may use fold:
val iter = List(1,2,3,101,330,302).iterator
iter.foldLeft(Map[Int, Int]()) {(accum, a) =>
val key = a/100 * 100;
accum + (key -> (accum.getOrElse(key, 0) + 1))}
// scala.collection.immutable.Map[Int,Int] = Map(0 -> 3, 100 -> 1, 300 - 2)

Count occurrences of each item in a Scala parallel collection

My question is very similar to Count occurrences of each element in a List[List[T]] in Scala, except that I would like to have an efficient solution involving parallel collections.
Specifically, I have a large (~10^7) vector vec of short (~10) lists of Ints, and I would like to get for each Int x the number of times x occurs, for example as a Map[Int,Int]. The number of distinct integers is of the order 10^6.
Since the machine this needs to be done on has a fair amount of memory (150GB) and number of cores (>100) it seems like parallel collections would be a good choice for this. Is the code below a good approach?
val flatpvec = vec.par.flatten
val flatvec = flatpvec.seq
val unique = flatpvec.distinct
val counts = unique map (x => (x -> flatvec.count(_ == x)))
counts.toMap
Or are there better solutions? In case you are wondering about the .seq conversion: for some reason the following code doesn't seem to terminate, even for small examples:
val flatpvec = vec.par.flatten
val unique = flatpvec.distinct
val counts = unique map (x => (x -> flatpvec.count(_ == x)))
counts.toMap
This does something. aggregate is like fold except you also combine the results of the sequential folds.
Update: It's not surprising that there is overhead in .par.groupBy, but I was surprised by the constant factor. By these numbers, you would never count that way. Also, I had to bump the memory way up.
The interesting technique used to build the result map is described in this paper linked from the overview. (It cleverly saves the intermediate results and then coalesces them in parallel at the end.)
But copying around the intermediate results of the groupBy turns out to be expensive, if all you really want is a count.
The numbers are comparing sequential groupBy, parallel, and finally aggregate.
apm#mara:~/tmp$ scalacm countints.scala ; scalam -J-Xms8g -J-Xmx8g -J-Xss1m countints.Test
GroupBy: Starting...
Finished in 12695
GroupBy: List((233,10078), (237,20041), (268,9939), (279,9958), (315,10141), (387,9917), (462,9937), (680,9932), (848,10139), (858,10000))
Par GroupBy: Starting...
Finished in 51481
Par GroupBy: List((233,10078), (237,20041), (268,9939), (279,9958), (315,10141), (387,9917), (462,9937), (680,9932), (848,10139), (858,10000))
Aggregate: Starting...
Finished in 2672
Aggregate: List((233,10078), (237,20041), (268,9939), (279,9958), (315,10141), (387,9917), (462,9937), (680,9932), (848,10139), (858,10000))
Nothing magical in the test code.
import collection.GenTraversableOnce
import collection.concurrent.TrieMap
import collection.mutable
import concurrent.duration._
trait Timed {
def now = System.nanoTime
def timed[A](op: =>A): A = {
val start = now
val res = op
val end = now
val lapsed = (end - start).nanos.toMillis
Console println s"Finished in $lapsed"
res
}
def showtime(title: String, op: =>GenTraversableOnce[(Int,Int)]): Unit = {
Console println s"$title: Starting..."
val res = timed(op)
//val showable = res.toIterator.min //(res.toIterator take 10).toList
val showable = res.toList.sorted take 10
Console println s"$title: $showable"
}
}
It generates some random data for interest.
object Test extends App with Timed {
val upto = math.pow(10,6).toInt
val ran = new java.util.Random
val ten = (1 to 10).toList
val maxSamples = 1000
// samples of ten random numbers in the desired range
val samples = (1 to maxSamples).toList map (_ => ten map (_ => ran nextInt upto))
// pick a sample at random
def anyten = samples(ran nextInt maxSamples)
def mag = 7
val data: Vector[List[Int]] = Vector.fill(math.pow(10,mag).toInt)(anyten)
The sequential operation and the combining operation of aggregate are invoked from a task, and the result is assigned to a volatile var.
def z: mutable.Map[Int,Int] = mutable.Map.empty[Int,Int]
def so(m: mutable.Map[Int,Int], is: List[Int]) = {
for (i <- is) {
val v = m.getOrElse(i, 0)
m(i) = v + 1
}
m
}
def co(m: mutable.Map[Int,Int], n: mutable.Map[Int,Int]) = {
for ((i, count) <- n) {
val v = m.getOrElse(i, 0)
m(i) = v + count
}
m
}
showtime("GroupBy", data.flatten groupBy identity map { case (k, vs) => (k, vs.size) })
showtime("Par GroupBy", data.flatten.par groupBy identity map { case (k, vs) => (k, vs.size) })
showtime("Aggregate", data.par.aggregate(z)(so, co))
}
If you want to make use of parallel collections and Scala standard tools, you could do it like that. Group your collection by the identity and then map it to (Value, Count):
scala> val longList = List(1, 5, 2, 3, 7, 4, 2, 3, 7, 3, 2, 1, 7)
longList: List[Int] = List(1, 5, 2, 3, 7, 4, 2, 3, 7, 3, 2, 1, 7)
scala> longList.par.groupBy(x => x)
res0: scala.collection.parallel.immutable.ParMap[Int,scala.collection.parallel.immutable.ParSeq[Int]] = ParMap(5 -> ParVector(5), 1 -> ParVector(1, 1), 2 -> ParVector(2, 2, 2), 7 -> ParVector(7, 7, 7), 3 -> ParVector(3, 3, 3), 4 -> ParVector(4))
scala> longList.par.groupBy(x => x).map(x => (x._1, x._2.size))
res1: scala.collection.parallel.immutable.ParMap[Int,Int] = ParMap(5 -> 1, 1 -> 2, 2 -> 3, 7 -> 3, 3 -> 3, 4 -> 1)
Or even nicer like pagoda_5b suggested in the comments:
scala> longList.par.groupBy(identity).mapValues(_.size)
res1: scala.collection.parallel.ParMap[Int,Int] = ParMap(5 -> 1, 1 -> 2, 2 -> 3, 7 -> 3, 3 -> 3, 4 -> 1)