Given the following list of tuples...
val list = List((1, 2), (1, 2), (1, 2))
... how do I sum all the values and obtain a single tuple like this?
(3, 6)
Using the foldLeft method. Please look at the scaladoc for more information.
scala> val list = List((1, 2), (1, 2), (1, 2))
list: List[(Int, Int)] = List((1,2), (1,2), (1,2))
scala> list.foldLeft((0, 0)) { case ((accA, accB), (a, b)) => (accA + a, accB + b) }
res0: (Int, Int) = (3,6)
Using unzip. Not as efficient as the above solution. Perhaps more readable.
scala> list.unzip match { case (l1, l2) => (l1.sum, l2.sum) }
res1: (Int, Int) = (3,6)
Very easy: (list.map(_._1).sum, list.map(_._2).sum).
You can solve this using Monoid.combineAll from the cats library:
import cats.instances.int._ // For monoid instances for `Int`
import cats.instances.tuple._ // for Monoid instance for `Tuple2`
import cats.Monoid.combineAll
def main(args: Array[String]): Unit = {
val list = List((1, 2), (1, 2), (1, 2))
val res = combineAll(list)
println(res)
// Displays
// (3, 6)
}
You can see more about this in the cats documentation or Scala with Cats.
answering to this question while trying to understand aggregate function in spark
scala> val list = List((1, 2), (1, 2), (1, 2))
list: List[(Int, Int)] = List((1,2), (1,2), (1,2))
scala> list.aggregate((0,0))((x,y)=>((y._1+x._1),(x._2+y._2)),(x,y)=>(x._1+y._2,y._2+x._2))
res89: (Int, Int) = (3,6)
Here is the link to the SO QA that helped to understand and answer this [Explain the aggregate functionality in Spark
Scalaz solution (suggestied by Travis and for some reason a deleted answer):
import scalaz._
import Scalaz._
val list = List((1, 2), (1, 2), (1, 2))
list.suml
which outputs
res0: (Int, Int) = (3,6)
You can also use a reduce function :
val list = List((1, 2), (1, 2), (1, 2))
val res = list.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
Related
I have a spark rdd with a column like
List(1, 3, 4, 8)
List(2, 3)
List(1, 5, 6)
I would like to get a new rdd with consecutive elements in each list to rows, like
(1, 3)
(3, 4)
(4, 8)
(2, 3)
(1, 5)
(5, 6)
How can I achieve this with scala?
Consider:
using a complementary (plain Scala) function with signature List[Int] => List[(Int, Int)] to achieve the desired result for the single list
and
passing this function to your RDD's flatMap method.
This complementary function may look like this:
def makeTuples(l: List[Int],
acc: List[(Int, Int)] = List.empty): List[(Int, Int)] =
l match {
case Nil | _ :: Nil => acc.reverse
case a :: b :: rest => makeTuples(b :: rest, (a, b) :: acc)
}
I have an RDD Array like this in scala-spark:
Array[(String,Int)]= Array((A1:B,1), (A1:A,10), (A2:C,5), (A2:E,5), (A3:D,3))
and i need to group it by the first parameter A1 or A2 or A3 so as each of these be a list containing numbers respectively like this:
List( A1:(1,10), A2:(5,5), A3:(3) )
please help me
Considering it as an RDD, we can do it following way.
scala> val x = List(("A1:B",1),("A1:A",10),("A2:C",5),("A2:E",5),("A3:D",3))
x: List[(String, Int)] = List((A1:B,1), (A1:A,10), (A2:C,5), (A2:E,5), (A3:D,3))
scala> x.map( a=> (a._1.split(":"),a._2))
res1: List[(Array[String], Int)] = List((Array(A1, B),1), (Array(A1, A),10), (Array(A2, C),5), (Array(A2, E),5), (Array(A3, D),3))
scala> res1.map(a => (a._1(0),a._2))
res12: List[(String, Int)] = List((A1,1), (A1,10), (A2,5), (A2,5), (A3,3))
scala> val rdd = sc.makeRDD(res12)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[15] at makeRDD at <console>:33
scala> rdd.groupByKey()
res13: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[16] at groupByKey at <console>:36
scala> res13.collect
res14: Array[(String, Iterable[Int])] = Array((A3,CompactBuffer(3)), (A1,CompactBuffer(1, 10)), (A2,CompactBuffer(5, 5)))
You can try this:
val data = Array(("A1:B", 1), ("A1:A", 10), ("A2:C", 5), ("A2:E", 5), ("A3:D", 3))
val grpData = data.groupBy(f => f._1.split(":")(0)).map(x => (x._1 + ":(" + x._2.map(_._2).mkString(",") + ")")).toList
println(grpData)
This might be pretty simple questions. I have a list named "List1" that contain list of integer pairs as below.
List1 = List((1,2), (3,4), (9,8), (9,10))
Output should be:
r1 = (1,3,9,9) //List((1,2), (3,4), (9,8), (9,10))
r2 = (2,4,8,10) //List((1,2), (3,4), (9,8), (9,10))
array r1(Array[int]) should contains set of all first integers of each pair in the list.
array r2(Array[int]) should contains set of all second integers of each pair
Just use unzip:
scala> List((1,2), (3,4), (9,8), (9,10)).unzip
res0: (List[Int], List[Int]) = (List(1, 3, 9, 9),List(2, 4, 8, 10))
Use foldLeft
val (alist, blist) = list1.foldLeft((List.empty[Int], List.empty[Int])) { (r, c) => (r._1 ++ List(c._1), r._2 ++ List(c._2))}
Scala REPL
scala> val list1 = List((1, 2), (3, 4), (5, 6))
list1: List[(Int, Int)] = List((1,2), (3,4), (5,6))
scala> val (alist, blist) = list1.foldLeft((List.empty[Int], List.empty[Int])) { (r, c) => (r._1 ++ List(c._1), r._2 ++ List(c._2))}
alist: List[Int] = List(1, 3, 5)
blist: List[Int] = List(2, 4, 6)
I am new to spark programming and scala and i am not able to understand the difference between map and flatMap.
I tried below code as i was expecting both to work but got error.
scala> val b = List("1","2", "4", "5")
b: List[String] = List(1, 2, 4, 5)
scala> b.map(x => (x,1))
res2: List[(String, Int)] = List((1,1), (2,1), (4,1), (5,1))
scala> b.flatMap(x => (x,1))
<console>:28: error: type mismatch;
found : (String, Int)
required: scala.collection.GenTraversableOnce[?]
b.flatMap(x => (x,1))
As per my understanding flatmap make Rdd in to collection for String/Int Rdd.
I was thinking that in this case both should work without any error.Please let me know where i am making the mistake.
Thanks
You need to look at how the signatures defined these methods:
def map[U: ClassTag](f: T => U): RDD[U]
map takes a function from type T to type U and returns an RDD[U].
On the other hand, flatMap:
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
Expects a function taking type T to a TraversableOnce[U], which is a trait Tuple2 doesn't implement, and returns an RDD[U]. Generally, you use flatMap when you want to flatten a collection of collections, i.e. if you had an RDD[List[List[Int]] and you want to produce a RDD[List[Int]] you can flatMap it using identity.
map(func) Return a new distributed dataset formed by passing each element of the source through a function func.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
The following example might be helpful.
scala> val b = List("1", "2", "4", "5")
b: List[String] = List(1, 2, 4, 5)
scala> b.map(x=>Set(x,1))
res69: List[scala.collection.immutable.Set[Any]] =
List(Set(1, 1), Set(2, 1), Set(4, 1), Set(5, 1))
scala> b.flatMap(x=>Set(x,1))
res70: List[Any] = List(1, 1, 2, 1, 4, 1, 5, 1)
scala> b.flatMap(x=>List(x,1))
res71: List[Any] = List(1, 1, 2, 1, 4, 1, 5, 1)
scala> b.flatMap(x=>List(x+1))
res75: scala.collection.immutable.Set[String] = List(11, 21, 41, 51) // concat
scala> val x = sc.parallelize(List("aa bb cc dd", "ee ff gg hh"), 2)
scala> val y = x.map(x => x.split(" ")) // split(" ") returns an array of words
scala> y.collect
res0: Array[Array[String]] = Array(Array(aa, bb, cc, dd), Array(ee, ff, gg, hh))
scala> val y = x.flatMap(x => x.split(" "))
scala> y.collect
res1: Array[String] = Array(aa, bb, cc, dd, ee, ff, gg, hh)
Map operation return type is U where as flatMap return type is TraversableOnce[U](means collections)
val b = List("1", "2", "4", "5")
val mapRDD = b.map { input => (input, 1) }
mapRDD.foreach(f => println(f._1 + " " + f._2))
val flatmapRDD = b.flatMap { input => List((input, 1)) }
flatmapRDD.foreach(f => println(f._1 + " " + f._2))
map does a 1-to-1 transformation, while flatMap converts a list of lists to a single list:
scala> val b = List(List(1,2,3), List(4,5,6), List(7,8,90))
b: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6), List(7, 8, 90))
scala> b.map(x => (x,1))
res1: List[(List[Int], Int)] = List((List(1, 2, 3),1), (List(4, 5, 6),1), (List(7, 8, 90),1))
scala> b.flatMap(x => x)
res2: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 90)
Also, flatMap is useful for filtering out None values if you have a list of Options:
scala> val c = List(Some(1), Some(2), None, Some(3), Some(4), None)
c: List[Option[Int]] = List(Some(1), Some(2), None, Some(3), Some(4), None)
scala> c.flatMap(x => x)
res3: List[Int] = List(1, 2, 3, 4)
I wish to add a list of tuples of integers i.e. given an input list of tuples of arity k, produce a tuple of arity k whose fields are sums of corresponding fields of the tuples in the list.
Input
List( (1,2,3), (2,3,-3), (1,1,1))
Output
(4, 6, 1)
I was trying to use foldLeft, but I am not able to get it to compile. Right now, I am using a for loop, but I was looking for a more concise solution.
This can be done type safely and very concisely using shapeless,
scala> import shapeless._, syntax.std.tuple._
import shapeless._
import syntax.std.tuple._
scala> val l = List((1, 2, 3), (2, 3, -1), (1, 1, 1))
l: List[(Int, Int, Int)] = List((1,2,3), (2,3,-1), (1,1,1))
scala> l.map(_.toList).transpose.map(_.sum)
res0: List[Int] = List(4, 6, 3)
Notice that unlike solutions which rely on casts, this approach is type safe, and any type errors are detected at compile time rather than at runtime,
scala> val l = List((1, 2, 3), (2, "foo", -1), (1, 1, 1))
l: List[(Int, Any, Int)] = List((1,2,3), (2,foo,-1), (1,1,1))
scala> l.map(_.toList).transpose.map(_.sum)
<console>:15: error: could not find implicit value for parameter num: Numeric[Any]
l.map(_.toList).transpose.map(_.sum)
^
scala> val tuples = List( (1,2,3), (2,3,-3), (1,1,1))
tuples: List[(Int, Int, Int)] = List((1,2,3), (2,3,-3), (1,1,1))
scala> tuples.map(t => t.productIterator.toList.map(_.asInstanceOf[Int])).transpose.map(_.sum)
res0: List[Int] = List(4, 6, 1)
Type information is lost when calling productIterator on Tuple3 so you have to convert from Any back to an Int.
If the tuples are always going to contain the same type I would suggest using another collection such as List. The Tuple is better suited for disparate types. When you have the same types and don't lose the type information by using productIterator the solution is more elegant.
scala> val tuples = List(List(1,2,3), List(2,3,-3), List(1,1,1))
tuples: List[List[Int]] = List(List(1, 2, 3), List(2, 3, -3), List(1, 1, 1))
scala> tuples.transpose.map(_.sum)
res1: List[Int] = List(4, 6, 1)
scala> val list = List( (1,2,3), (2,3,-3), (1,1,1))
list: List[(Int, Int, Int)] = List((1,2,3), (2,3,-3), (1,1,1))
scala> list.foldRight( (0, 0, 0) ){ case ((a, b, c), (a1, b1, c1)) => (a + a1, b + b1, c + c1) }
res0: (Int, Int, Int) = (4,6,1)