Scala Aggregate Projection - scala

How can I project / transform the following
List(("A", 1.0), ("A", 3.0), ("B", 2.0), ("B", 2.0))
to
List(("A", 4.0),("B", 4.0))
So I aggregate by the string and sum the doubles?

val x = List(("A", 1.0), ("A", 3.0), ("B", 2.0), ("B", 2.0))
val y = x.groupBy(_._1).map { case (a,bs) => a -> bs.map(_._2).sum }
y.toList // List((A,4.0), (B,4.0))

Related

Getting first n distinct Key Tuples in Scala Spark

I have a RDD with Tuple as follows
(a, 1), (a, 2), (b,1)
How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)
What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.
Here's what I threw together in Scala.
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.reduceByKey((k1,k2) => k1)
.collect()
Outputs
Array[(String, Int)] = Array((a,1), (b,1))
As you already have a RDD of Pair, your RDD has extra key-value functionality provided by org.apache.spark.rdd.PairRDDFunctions. Lets make use of that.
val pairRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
// RDD[(String, Int)]
val groupedRdd = pairRdd.groupByKey()
// RDD[(String, Iterable[Int])]
val requiredRdd = groupedRdd.map((key, iter) => (key, iter.head))
// RDD[(String, Int)]
Or in short
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.groupByKey()
.map((key, iter) => (key, iter.head))
It is easy.....
you just need to use the function just like the bellow:
val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
data.collectAsMap().foreach(println)

How to define Tuple1 in Scala?

I try to use (1,), but doesn't work, what's the syntax to define Tuple1 in scala ?
scala> val a=(1,)
<console>:1: error: illegal start of simple expression
val a=(1,)
For tuple with cardinality 2 or more, you can use parentheses, however for with cardinality 1, you need to use Tuple1:
scala> val tuple1 = Tuple1(1)
tuple1: (Int,) = (1,)
scala> val tuple2 = ('a', 1)
tuple2: (Char, Int) = (a,1)
scala> val tuple3 = ('a', 1, "name")
tuple3: (Char, Int, java.lang.String) = (a,1,name)
scala> tuple1._1
res0: Int = 1
scala> tuple2._2
res1: Int = 1
scala> tuple3._1
res2: Char = a
scala> tuple3._3
res3: String = name
To declare the type, use Tuple1[T], for example val t : Tuple1[Int] = Tuple1(22)
A tuple is, by definition, an ordered list of elements. While Tuple1 exists, I haven't seen it used explicitly given you'd normally use a single element. Nevertheless, there is no sugar, you need to use Tuple1(1).
There is a valid use case in Spark that requires Tuple1: create a dataframe with one column.
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
data.toDF("features").show()
It will throw an error:
"value toDF is not a member of Seq[org.apache.spark.ml.linalg.Vector]"
To make it work, we have to convert each row to Tuple1:
val data = Seq(
Tuple1(Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))),
Tuple1(Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0)),
Tuple1(Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
)
or a better way:
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
).map(Tuple1.apply)

Scala - groupBy map values to list

I have the following input data
((A, 1, 4), (A, 2, 5), (A, 3, 6))
I would like to produce the following result
(A, (1, 2, 3), (4, 5, 6))
through grouping input by keys
What would be the correct way to do so in Scala?
((A, 1, 4), (A, 2, 5), (A, 3, 6))
In the case that this represents a List[(String, Int, Int)] type then try the following.
val l = List(("A", 1, 4), ("A", 2, 5), ("A", 3, 6), ("B", 1, 4), ("B", 2, 5), ("B", 3, 6))
l groupBy {_._1} map {
case (k, v) => (k, v map {
case (k, v1, v2) => (v1, v2)
} unzip)
}
This will result in a Map[String,(List[Int], List[Int])], i.e., a map with string keys mapped to tuples of two lists.
Map(A -> (List(1, 2, 3), List(4, 5, 6)), B -> (List(1, 2, 3), List(4, 5, 6)))
Try something like this:
def takeHeads[A](lists:List[List[A]]): (List[A], List[List[A]]) =
(lists.map(_.head), lists.map(_.tail))
def translate[A](lists:List[List[A]]): List[List[A]] =
if (lists.flatten.isEmpty) Nil else {
val t = takeHeads(lists)
t._1 :: translate(t._2)
}
yourValue.groupBy(_.head).mapValues(v => translate(v.map(_.tail)))
This produces a Map[Any,Any] when used on your value... but it should get you going in the right direction.

How to filter a RDD according to a function based another RDD in Spark?

I am a beginner of Apache Spark. I want to filter out all groups whose sum of weight is larger than a constant value in a RDD. The "weight" map is also a RDD. Here is a small-size demo, the groups to be filtered is stored in "groups", the constant value is 12:
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
val wm = weights.toArray.toMap
def isheavy(inp: String): Boolean = {
val allw = inp.split(",").map(wm(_)).sum
allw > 12
}
val result = groups.filter(isheavy)
When the input data is very large, > 10GB for example, I always encounter a "java heap out of memory" error. I doubted if it's caused by "weights.toArray.toMap", because it convert an distributed RDD to an Java object in JVM. So I tried to filter with RDD directly:
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
def isheavy(inp: String): Boolean = {
val items = inp.split(",")
val wm = items.map(x => weights.filter(_._1 == x).first._2)
wm.sum > 12
}
val result = groups.filter(isheavy)
When I ran result.collect after loading this script into spark shell, I got a "java.lang.NullPointerException" error. Someone told me when a RDD is manipulated in another RDD, there will be a nullpointer exception, and suggest me to put the weight into Redis.
So how can I get the "result" without convert "weight" to Map, or put it into Redis? If there is a solution to filter a RDD based on another map-like RDD without the help of external datastore service?
Thanks!
Suppose your group is unique. Otherwise, first make it unique by distinct, etc.
If group or weights is small, it should be easy. If both group and weights are huge, you can try this, which may be more scalable, but also looks complicated.
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
//map groups to be (a, (a,b,c,d)), (b, (a,b,c,d), (c, (a,b,c,d)....
val g1=groups.flatMap(s=>s.split(",").map(x=>(x, Seq(s))))
//j will be (a, ((a,b,c,d),3)...
val j = g1.join(weights)
//k will be ((a,b,c,d), 3), ((a,b,c,d),2) ...
val k = j.map(x=>(x._2._1, x._2._2))
//l will be ((a,b,c,d), (3,2,5,1))...
val l = k.groupByKey()
//filter by sum the 2nd
val m = l.filter(x=>{var sum = 0; x._2.foreach(a=> {sum=sum+a});sum > 12})
//we only need the original list
val result=m.map(x=>x._1)
//don't do this in real product, otherwise, all results go to driver.instead using saveAsTextFile, etc
scala> result.foreach(println)
List(e,g)
List(b,c,e)
The "java out of memory" error is coming because spark uses its spark.default.parallelism property while determining number of splits, which by default is number of cores available.
// From CoarseGrainedSchedulerBackend.scala
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
When the input becomes large, and you have limited memory, you should increase number of splits.
You can do something as follows:
val input = List("a,b,c,d", "b,c,e", "a,c,d", "e,g")
val splitSize = 10000 // specify some number of elements that fit in memory.
val numSplits = (input.size / splitSize) + 1 // has to be > 0.
val groups = sc.parallelize(input, numSplits) // specify the # of splits.
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)).toMap
def isHeavy(inp: String) = inp.split(",").map(weights(_)).sum > 12
val result = groups.filter(isHeavy)
You may also consider increasing executor memory size using spark.executor.memory.

Sequence transformations in scala

In scala, is there a simple way of transforming this sequence
Seq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
into this Seq(("a", 4), ("b", 7), ("c", 4))?
Thanks
I'm not sure if you meant to have Strings in the second ordinate of the tuple. Assuming Seq[(String, Int)], you can use groupBy to group the elements by the first ordinate:
Seq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
.groupBy(_._1)
.mapValues(_.map(_._2).sum)
.toSeq
Otherwise, you'll next an extra .toInt
Here is another way by unzipping and using the second item in the tuple.
val sq = sqSeq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
sq.groupBy(_._1)
.transform {(k,lt) => lt.unzip._2.sum}
.toSeq
The above code details:
scala> sq.groupBy(_._1)
res01: scala.collection.immutable.Map[String,Seq[(String, Int)]] = Map(b -> List((b,2), (b,5)), a -> List((a,1), (a,3)), c -> List((c,4)))
scala> sq.groupBy(_._1).transform {(k,lt) => lt.unzip._2.sum}
res02: scala.collection.immutable.Map[String,Int] = Map(b -> 7, a -> 4, c -> 4)
scala> sq.groupBy(_._1).transform {(k,lt) => lt.unzip._2.sum}.toSeq
res03: Seq[(String, Int)] = ArrayBuffer((b,7), (a,4), (c,4))