Query related to combineByKey - scala

For the following input => [('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
after processing with combineByKey I am expecting the below output
Expected output => [('A', [(3, 9), (4, 16), (5, 25)]), ('B', [(1, 1), (2, 4)])]
scala> val x = sc.parallelize(Array(('B',1),('B',2),('A',3),('A',4),('A',5)))
x: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[46] at parallelize at <console>:24
scala> def createCombiner (element:Int) :String = (element.toString + "," + Math.pow(element,2).toInt)
createCombiner: (element: Int)String
scala> def mergeValue (accumlator:String, element:Int) : String = (accumlator + (element.toString + Math.pow(element,2).toInt))
mergeValue: (accumlator: String, element: Int)String
scala> def mergeComb (accumlator:String ,accumlator1:String):String = (accumlator + accumlator1)
mergeComb: (accumlator: String, accumlator1: String)String
scala> val combRDD = x.map(t => (t._1, (t._2))).combineByKey(createCombiner, mergeValue, mergeComb)
combRDD: org.apache.spark.rdd.RDD[(Char, String)] = ShuffledRDD[48] at combineByKey at <console>:31
scala> combRDD.collect
res39: Array[(Char, String)] = Array((A,3,94,165,25), (B,1,12,4))
I am not able to get the expected output. As, I am very new to spark I need some input on this.

What about:
scala> val x = sc.parallelize(Array(('B',1),('B',2),('A',3),('A',4),('A',5)))
scala> def createCombiner(element:Int) : List[(Int, Int)] = List(element -> element * element)
scala> def mergeValue (accumulator: List[(Int, Int)], element:Int) : List[(Int, Int)] = accumulator ++ createCombiner(element)
scala> def mergeComb (accumulator: List[(Int, Int)], accumulator1: List[(Int, Int)]): List[(Int, Int)] = (accumulator ++ accumulator1)
scala> val combRDD = x.combineByKey(createCombiner, mergeValue, mergeComb)
scala> combRDD.collect
// res0: Array[(Char, List[(Int, Int)])] = Array((A,List((3,9), (4,16), (5,25))), (B,List((1,1), (2,4))))
// Or
scala> combRDD.mapValues(_.mkString("[", ", ", "]")).collect
res1: Array[(Char, String)] = Array((A,[(3,9), (4,16), (5,25)]), (B,[(1,1), (2,4)]))

Related

How to find the sum of values by for each as per index position in spark/scala

scala> val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
dataArray: Array[String] = Array(a,1|2|3, b,4|5|6, a,7|8|9, b,10|11|12)
scala> val dataRDD = sc.parallelize(dataArray)
dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:26
scala> val mapRDD = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1).split("\\|")))
mapRDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[9] at map at
<console>:25
scala> mapRDD.collect
res20: Array[(String, Array[String])] = Array((a,Array(1, 2, 3)), (b,Array(4, 5, 6)), (a,Array(7, 8, 9)), (b,Array(10, 11, 12)))
scala> mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
<console>:26: error: type mismatch;
found : List[String]
required: Array[String]
mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
I tried like this further
val finalRDD = mapRDD.map(elem => (elem._1,elem._2.mkString("#")))
scala> finalRDD.reduceByKey((a,b) => (a.split("#")(0).toInt + b.split("#")(0).toInt).toString )
res31: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[14] at reduceByKey at <console>:26
scala> res31.collect
res32: Array[(String, String)] = Array((b,14), (a,8))
As you can see i am not able to get the result for all indexes , my code gives the sum only for one index .
My expected output is below
I want the sum to be applied on index basis such sum of all a[0] and sum of all a[1]
(a,(8,10,12))
(b,(14,16,18))
please help
Use transpose and map the result to _.sum
import org.apache.spark.rdd.RDD
object RddExample {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val sc = spark.sparkContext
// import spark.implicits._
val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
val dataRDD = sc.parallelize(dataArray)
val mapRDD : RDD[(String, Array[Int])] = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1)
.split("\\|").map(_.toInt)))
// val mapRdd : Array[(String, Array[String])] = mapRDD.collect
val result : Array[(String, List[Int])] = mapRDD.groupByKey().mapValues(itr => {
itr.toList.transpose.map(_.sum)
}).collect()
println(result)
}
}
I feel that QuickSilver answer is perfect
I tried this approach using reduceByKey , but i am manually adding for every index , if there are more indexes, then this wont help
Till mapRDD the code is same as per my question
.....
.....
mapRDD = code as per Question
scala> val finalRDD = mapRDD.map(elem => (elem._1, elem._2 match {
| case Array(a:String,b:String,c:String) => (a.toInt,b.toInt,c.toInt)
| case _ => (100,200,300)
| }
| )
| )
scala> finalRDD.collect
res39: Array[(String, (Int, Int, Int))] = Array((a,(1,2,3)), (b,(4,5,6)), (a,(7,8,9)), (b,(10,11,12)))
scala> finalRDD.reduceByKey((v1,v2) => (v1._1+v2._1,v1._2+v2._2,v1._3+v2._3))
res40: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ShuffledRDD[18] at reduceByKey at <console>:26
scala> res40.collect
res41: Array[(String, (Int, Int, Int))] = Array((b,(14,16,18)), (a,(8,10,12)))

How to access to a value of a scala Tuples

I have a sequence of tuples that with a value and his power 2:
val fields3: Seq[(Int, Int)] = Seq((3, 9), (5, 25))
the thing that I want to know is if there is a way to access to a value of the same tuple directly when I create the object whithout use a foreach:
val fields3: Seq[(Int, Int)] = Seq((3, 3 * 3 ), (5, 5 * 5))
my idea is something like:
val fields3: Seq[(Int, Int)] = Seq((3, _1 * _1 ), (5, _1 * _1)) //like this doesn't compile
You can do something like this:
Seq(2,3,4).map(i => (i, i*i))
You could wrap the tuple in a case class potentially:
case class TupleInt(base: Int) {
val tuple: (Int, Int) = (base, base*base)
}
Then you could create the sequence like this:
val fields3: Seq[(Int, Int)] = Seq(TupleInt(3), TupleInt(5)).map(_.tuple)
I would prefer the answer #geek94 gave, this is too verbose for what you want to do.
An equally valid way to express this is:
val fields3: Seq[(Int, Int)] = Seq(3, 5).map(i => i -> i*i)

Flatmap scala [String, String,List[String]]

I have this prbolem, I have an RDD[(String,String, List[String]), and I would like to "flatmap" it to obtain a RDD[(String,String, String)]:
e.g:
val x :RDD[(String,String, List[String]) =
RDD[(a,b, list[ "ra", "re", "ri"])]
I would like get:
val result: RDD[(String,String,String)] =
RDD[(a, b, ra),(a, b, re),(a, b, ri)])]
Use flatMap:
val rdd = sc.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
// rdd: org.apache.spark.rdd.RDD[(String, String, List[String])] = ParallelCollectionRDD[7] at parallelize at <console>:28
rdd.flatMap{ case (x, y, z) => z.map((x, y, _)) }.collect
// res23: Array[(String, String, String)] = Array((a,b,ra), (a,b,re), (a,b,ri))
This is an alternative way of doing it using flatMap again
val rdd = sparkContext.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
rdd.flatMap(array => array._3.map(list => (array._1, array._2, list))).foreach(println)

Using scalaz state in a more complicated computation

I'm trying to understand how to use scalaz State to perform a complicated stateful computation. Here is the problem:
Given a List[Int] of potential divisors and a List[Int] of numbers, find a List[(Int, Int)] of matching pairs (divisor, number) where a divisor is allowed to match at most one number.
As a test:
def findMatches(divs: List[Int], nums: List[Int]): List[(Int, Int)]
And with the following input:
findMatches( List(2, 3, 4), List(1, 6, 7, 8, 9) )
We can get at most 3 matches. If we stipulate that the matches must be made in the order in which they occur traversing the lists l-r, then the matches must be:
List( (2, 6) , (3, 9) , (4, 8) )
So the following two tests need to pass:
assert(findMatches(List(2, 3, 4), List(1, 6, 7, 8, 9)) == List((2, 6), (3, 9), (4, 8)))
assert(findMatches(List(2, 3, 4), List(1, 6, 7, 8, 11)) == List((2, 6), (4, 8)))
Here's an imperative solution:
scala> def findMatches(divs: List[Int], nums: List[Int]): List[(Int, Int)] = {
| var matches = List.empty[(Int, Int)]
| var remaining = nums
| divs foreach { div =>
| remaining find (_ % div == 0) foreach { n =>
| remaining = remaining filterNot (_ == n)
| matches = matches ::: List(div -> n)
| }
| }
| matches
| }
findMatches: (divs: List[Int], nums: List[Int])List[(Int, Int)]
Notice that I have to update the state of remaining as well as accumulating matches. It sounds like a job for scalaz traverse!
My useless working has got me this far:
scala> def findMatches(divs: List[Int], nums: List[Int]): List[(Int, Int)] = {
| divs.traverse[({type l[a] = State[List[Int], a]})#l, Int]( div =>
| state { (rem: List[Int]) => rem.find(_ % div == 0).map(n => rem.filterNot(_ == n) -> List(div -> n)).getOrElse(rem -> List.empty[(Int, Int)]) }
| ) ~> nums
| }
<console>:15: error: type mismatch;
found : List[(Int, Int)]
required: Int
state { (rem: List[Int]) => rem.find(_ % div == 0).map(n => rem.filterNot(_ == n) -> List(div -> n)).getOrElse(rem -> List.empty[(Int, Int)]) }
^
Your code only needs to be slightly modified in order to use State and Traverse:
// using scalaz-seven
import scalaz._
import Scalaz._
def findMatches(divs: List[Int], nums: List[Int]) = {
// the "state" we carry when traversing
case class S(matches: List[(Int, Int)], remaining: List[Int])
// initially there are no found pairs and a full list of nums
val initialState = S(List[(Int, Int)](), nums)
// a function to find a pair (div, num) given the current "state"
// we return a state transition that modifies the state
def find(div: Int) = modify((s: S) =>
s.remaining.find(_ % div == 0).map { (n: Int) =>
S(s.matches :+ div -> n, s.remaining -n)
}.getOrElse(s))
// the traversal, with no type annotation thanks to Scalaz7
// Note that we use `exec` to get the final state
// instead of `eval` that would just give us a List[Unit].
divs.traverseS(find).exec(initialState).matches
}
// List((2,6), (3,9), (4,8))
findMatches(List(2, 3, 4), List(1, 6, 7, 8, 9))
You can also use runTraverseS to write the traversal a bit differently:
divs.runTraverseS(initialState)(find)._2.matches
I have finally figured this out after much messing about:
scala> def findMatches(divs: List[Int], nums: List[Int]): List[(Int, Int)] = {
| (divs.traverse[({type l[a] = State[List[Int], a]})#l, Option[(Int, Int)]]( div =>
| state { (rem: List[Int]) =>
| rem.find(_ % div == 0).map(n => rem.filterNot(_ == n) -> Some(div -> n)).getOrElse(rem -> none[(Int, Int)])
| }
| ) ! nums).flatten
| }
findMatches: (divs: List[Int], nums: List[Int])List[(Int, Int)]
I think I'll be looking at Eric's answer for more insight into what is actually going on, though.
Iteration #2
Exploring Eric's answer using scalaz6
scala> def findMatches2(divs: List[Int], nums: List[Int]): List[(Int, Int)] = {
| case class S(matches: List[(Int, Int)], remaining: List[Int])
| val initialState = S(nil[(Int, Int)], nums)
| def find(div: Int, s: S) = {
| val newState = s.remaining.find(_ % div == 0).map { (n: Int) =>
| S(s.matches :+ div -> n, s.remaining filterNot (_ == n))
| }.getOrElse(s)
| newState -> newState.matches
| }
| val findDivs = (div: Int) => state((s: S) => find(div, s))
| (divs.traverse[({type l[a]=State[S, a]})#l, List[(Int, Int)]](findDivs) ! initialState).join
| }
findMatches2: (divs: List[Int], nums: List[Int])List[(Int, Int)]
scala> findMatches2(List(2, 3, 4), List(1, 6, 7, 8, 9))
res11: List[(Int, Int)] = List((2,6), (2,6), (3,9), (2,6), (3,9), (4,8))
The join on the List[List[(Int, Int)]] at the end is causing grief. Instead we can replace the last line with:
(divs.traverse[({type l[a]=State[S, a]})#l, List[(Int, Int)]](findDivs) ~> initialState).matches
Iteration #3
In fact, you can do away with the extra output of a state computation altogether and simplify even further:
scala> def findMatches2(divs: List[Int], nums: List[Int]): List[(Int, Int)] = {
| case class S(matches: List[(Int, Int)], remaining: List[Int])
| def find(div: Int, s: S) =
| s.remaining.find(_ % div == 0).map( n => S(s.matches :+ div -> n, s.remaining filterNot (_ == n)) ).getOrElse(s) -> ()
| (divs.traverse[({type l[a]=State[S, a]})#l, Unit](div => state((s: S) => find(div, s))) ~> S(nil[(Int, Int)], nums)).matches
| }
findMatches2: (divs: List[Int], nums: List[Int])List[(Int, Int)]
Iteration #4
modify described above by Apocalisp is also available in scalaz6 and removes the need to explicitly supply the (S, ()) pair (although you do need Unit in the type lambda):
scala> def findMatches2(divs: List[Int], nums: List[Int]): List[(Int, Int)] = {
| case class S(matches: List[(Int, Int)], remaining: List[Int])
| def find(div: Int) = modify( (s: S) =>
| s.remaining.find(_ % div == 0).map( n => S(s.matches :+ div -> n, s.remaining filterNot (_ == n)) ).getOrElse(s))
| (divs.traverse[({type l[a]=State[S, a]})#l, Unit](div => state(s => find(div)(s))) ~> S(nil, nums)).matches
| }
findMatches2: (divs: List[Int], nums: List[Int])List[(Int, Int)]
scala> findMatches2(List(2, 3, 4), List(1, 6, 7, 8, 9))
res0: List[(Int, Int)] = List((2,6), (3,9), (4,8))

How would I yield an immutable.Map in Scala?

I have tried this but it does not work:
val map:Map[String,String] = for {
tuple2 <- someList
} yield tuple2._1 -> tuple2._2
How else would I convert a List of Tuple2s into a Map?
It couldn't be simpler:
Map(listOf2Tuples: _*)
using the apply method in Map companion object.
My First try is this:
scala> val country2capitalList = List("England" -> "London", "Germany" -> "Berlin")
country2capitalList: List[(java.lang.String, java.lang.String)] = List((England,London), (Germany,Berlin))
scala> val country2capitalMap = country2capital.groupBy(e => e._1).map(e => (e._1, e._2(0)._2))
country2capitalMap: scala.collection.Map[java.lang.String,java.lang.String] = Map(England -> London, Germany -> Berlin)
But here is the best solution:
scala> val betterConversion = Map(country2capitalList:_*)
betterConversion: scala.collection.immutable.Map[java.lang.String,java.lang.String] = Map(England -> London, Germany -> Berlin)
The :_* is needed to give the compiler a hint to use the list as a varargs argument. Otherwise it will give you:
scala> Map(country2capitalList)
<console>:6: error: type mismatch;
found : List[(java.lang.String, java.lang.String)]
required: (?, ?)
Map(country2capitalList)
^
From Scala 2.8 on you can use toMap:
scala> val someList = List((1, "one"), (2, "two"))
someList: List[(Int, java.lang.String)] = List((1,one), (2,two))
scala> someList.toMap
res0: scala.collection.immutable.Map[Int,java.lang.String] = Map((1,one), (2,two))
In 2.8, you can use the toMap method:
scala> val someList = List((1, "one"), (2, "two"))
someList: List[(Int, java.lang.String)] = List((1,one), (2,two))
scala> someList.toMap
res0: scala.collection.immutable.Map[Int,java.lang.String] = Map((1,one), (2,two))
This will work for any collection of pairs. Note that the documentation has this to say about its duplicate policy:
Duplicate keys will be overwritten by
later keys: if this is an unordered
collection, which key is in the
resulting map is undefined.
In scala 2.8:
scala> import scala.collection.breakOut
import scala.collection.breakOut
scala> val ls = List("a","bb","ccc")
ls: List[java.lang.String] = List(a, bb, ccc)
scala> val map: Map[String,Int] = ls.map{ s => (s,s.length) }(breakOut)
map: Map[String,Int] = Map((a,1), (bb,2), (ccc,3))
scala> val map2: Map[String,Int] = ls.map{ s => (s,s.length) }.toMap
map2: Map[String,Int] = Map((a,1), (bb,2), (ccc,3))
scala>