How to sum on a groupBy with an iterator? - scala

given Iterator[(String, Int)]
I would like to group by the String and sum the Int and return the results as a Map[String, Int]

You can convert it to a list or other strict structure:
iter.toList.groupBy(_._1).mapValues(_.map(_._2).sum)
If you don't want to convert to a strict structure (which forces all of the entries into memory), you can foldLeft and build the map as you go:
(Map.empty[String,Int] /: iter) {case (acc, (k,v)) =>
acc + (k -> acc.get(k).map(_ + v).getOrElse(v))
}

Related

Getting the mode from an RDD

I would like to get the mode (the most common number) from an rdd using Spark + Scala.
I can get it doing the following but I think it could be a better way to calculate this. The most important thing is if more than one value has the same number of repetition, I need to return both of them.
Let's see my example code:
val l = List(3,4,4,3,3,7,7,7,9)
val rdd = spark.sparkContext.parallelize(l)
val grouped = rdd.map (e => (e, 1)).groupBy(_._1).map(e=> (e._1, e._2.size))
val maxRep = grouped.collect().maxBy(_._2)._2
val mode = grouped.filter(e => e._2 == maxRep).map(e => e._1).collect
And the result is right:
Array[Int] = Array(3, 7)
but is there a better way to do this? I mean considering the performance because the original RDD would be much bigger than this.
This should work and be a little bit more efficient.
(only if you are sure the total number of elements is small)
val counted = rdd.countByValue()
val max = counted.valuesIterator.max
val maxElements = count.collect { case (k, v) if (v == max) => k }
If there could be many elements, consider this alternative which is memory safe.
val counted = rdd.map(x => (x, 1L)).reduceByKey(_ + _).cache()
val max = counted.values.max
val maxElements = counted.map { case (k, v) => (v, k) }.lookup(max)
How about get the max key-value pair from a double groupBy? This works even better for bigger data size.
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max
// res1: (Int, Iterable[(Int, Int)]) = (3,CompactBuffer((3,3), (7,3)))
To get the element
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max._2.map(_._1)
// res4: Iterable[Int] = List(3, 7)
The first groupBy will get element into (element -> count) with type Map[Int, Long], the second groupBy will group (element -> count) by count, like (count -> Iterable((element, count)), then simply max to get the key-value pair with the maximum key value, which is the count.

How to find the common values in key value pairs and put it as value in all pairs?

How can I get the intersection of values in key value pairs?
I have pairs:
(p, Set(n))
in which I used reduceByKey and finally got:
(p1, Set(n1, n2)) (p2, Set(n1, n2, n3)) (p3, Set(n2, n3))
What I want is to find n that exist in all of the pairs and put them as value. For the above data, the result would by
(p1, Set(n2)) (p2, Set(n2)), (p3, Set(n2))
As long as I searched, there is no reduceByValue in spark. The only function that seemed closer to what i want was reduce() but it didn't work as the result was only one key value pair ((p3, Set(n2))).
Is there any way to solve it? Or should i think something else from the start?
Code:
val rRdd = inputFile.map(x => (x._1, Set(x._2)).reduceByKey(_++_)
val wrongRdd = rRdd.reduce{(x, y) => (x._1, x._2.intersect(y._2))}
I can see why wrongRdd is not correct, I just put it to show how (p3, Set(n2)) resulted from.
You can first reduce the sets to their intersection (say, s), then replace (k, v) with (k, s):
val rdd = sc.parallelize(Seq(
("p1", Set("n1", "n2")),
("p2", Set("n1", "n2", "n3")),
("p3", Set("n2", "n3"))
))
val s = rdd.map(_._2).reduce(_ intersect _)
// s: scala.collection.immutable.Set[String] = Set(n2)
rdd.map{ case (k, v) => (k, s) }.collect
// res1: Array[(String, scala.collection.immutable.Set[String])] = Array(
// (p1,Set(n2)), (p2,Set(n2)), (p3,Set(n2))
// )

Scala Spark - Reduce RDD by adding multiple values per key

I have a Spark RDD that is in the format of (String, (Int, Int)) and I would like to add the Int values together to create a (String, Int) map.
This is an example of an element in my RDD:
res13: (String, (Int, Int)) = (9D4669B432A0FD,(1,1))
I would like to end with an RDD of (String, Int) = (9D4669B432A0FD,2)
You should just map the values to the sum of the second pair:
yourRdd.map(pair => (pair._1, pair._2._1 + pair._2._2))
#marios suggested the following nicer syntax in an edit:
Or if you want to make it a bit more readable:
yourRdd.map{case(str, (x1,x2)) => (str, x1+x2)}
Gabor Bakos answer is correct if there are unique keys. But If you have multiple identical keys and if you want to reduce it to unique keys then use reduceByKey.
Example:
val data = Array(("9888wq",(1,2)),("abcd",(1,1)),("abcd",(3,2)),("9888wq",(4,2)))
val rdd= sc.parallelize(data)
val result = rdd.map(x => (x._1,(x._2._1+x._2._2))).reduceByKey((x,y) => x+y)
result.foreach(println)
Output :
(9888wq,9)
(abcd,7)

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}

Scala extract from Seq of tuples

Here's a seq of tuples in Scala
val t = Seq((1,2,3),(4,5,6))
I like to extract the first element of each tuple into its own sequence, i.e.,
Seq(1,4)
How do I do this in Scala?
Simply use map and transform each tuple to its first element:
t.map(x => x._1)
Or shorter:
t.map(_._1)
The general form to extract more than one columns:
def extractColumns3[T1, T2, T3](t: Seq[(T1, T2, T3)]): (Seq[T1], Seq[T2], Seq[T3]) =
t.foldLeft((Seq.empty[T1], Seq.empty[T2], Seq.empty[T3])) { (columns, row) ⇒
(columns._1 :+ row._1, columns._2 :+ row._2, columns._3 :+ row._3)
}