Removing parenthesis after joining RDDs - scala

I am joining a large number of rdd's and I was wondering whether there is a generic way of removing the parenthesis that are being created on each join.
Here is a small sample:
val rdd1 = sc.parallelize(Array((1,2),(2,4),(3,6)))
val rdd2 = sc.parallelize(Array((1,7),(2,8),(3,6)))
val rdd3 = sc.parallelize(Array((1,2),(2,4),(3,6)))
val result = rdd1.join(rdd2).join(rdd3)
res: result: org.apache.spark.rdd.RDD[(Int, ((Int, Int), Int))] = Array((1,((2,7),2)), (3,((4,8),4)), (3,((4,8),6)), (3,((4,6),4)), (3,((4,6),6)))
I know I can use map
result.map((x) => (x._1,(x._2._1._1,x._2._1._2,x._2._2))).collect
Array[(Int, (Int, Int, Int))] = Array((1,(2,7,2)), (2,(4,8,4)), (3,(6,6,6)))
but with a large number of rdd's each containing many elements it quite quickly becomes difficult to use this method

With a large number of rdd's each containing many elements this approach simply won't work because the largest built-in tuple is still Tuple22. If you join homogeneous RDD some type of sequence:
def joinAndMerge(rdd1: RDD[(Int, Seq[Int])], rdd2: RDD[(Int, Seq[Int])]) =
rdd1.join(rdd2).mapValues{ case (x, y) => x ++ y }
Seq(rdd1, rdd2, rdd3).map(_.mapValues(Seq(_))).reduce(joinAndMerge)
If you have only three RDDs it can be cleaner to use cogroup:
rdd1.cogroup(rdd2, rdd3)
.flatMapValues { case (xs, ys, zs) => for {
x <- xs; y <- ys; z <- zs
} yield (x, y, z) }
If values are heterogenous it makes more sense to use DataFrames:
def joinByKey(df1: DataFrame, df2: DataFrame) = df1.join(df2, Seq("k"))
Seq(rdd1, rdd2, rdd3).map(_.toDF("k", "v")).reduce(joinByKey)

Related

Getting the mode from an RDD

I would like to get the mode (the most common number) from an rdd using Spark + Scala.
I can get it doing the following but I think it could be a better way to calculate this. The most important thing is if more than one value has the same number of repetition, I need to return both of them.
Let's see my example code:
val l = List(3,4,4,3,3,7,7,7,9)
val rdd = spark.sparkContext.parallelize(l)
val grouped = rdd.map (e => (e, 1)).groupBy(_._1).map(e=> (e._1, e._2.size))
val maxRep = grouped.collect().maxBy(_._2)._2
val mode = grouped.filter(e => e._2 == maxRep).map(e => e._1).collect
And the result is right:
Array[Int] = Array(3, 7)
but is there a better way to do this? I mean considering the performance because the original RDD would be much bigger than this.
This should work and be a little bit more efficient.
(only if you are sure the total number of elements is small)
val counted = rdd.countByValue()
val max = counted.valuesIterator.max
val maxElements = count.collect { case (k, v) if (v == max) => k }
If there could be many elements, consider this alternative which is memory safe.
val counted = rdd.map(x => (x, 1L)).reduceByKey(_ + _).cache()
val max = counted.values.max
val maxElements = counted.map { case (k, v) => (v, k) }.lookup(max)
How about get the max key-value pair from a double groupBy? This works even better for bigger data size.
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max
// res1: (Int, Iterable[(Int, Int)]) = (3,CompactBuffer((3,3), (7,3)))
To get the element
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max._2.map(_._1)
// res4: Iterable[Int] = List(3, 7)
The first groupBy will get element into (element -> count) with type Map[Int, Long], the second groupBy will group (element -> count) by count, like (count -> Iterable((element, count)), then simply max to get the key-value pair with the maximum key value, which is the count.

Spark Scala: Split each line between multiple RDDs

I have a file on HDFS in the form of:
61,139,75
63,140,77
64,129,82
68,128,56
71,140,47
73,141,38
75,128,59
64,129,61
64,129,80
64,129,99
I create an RDD from it and and zip the elements with their index:
val data = sc.textFile("hdfs://localhost:54310/usrp/sample.txt")
val points = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData = indexed.map{case (value,index) => (index,value)}
Now I need to create rdd1 with the index and the first two elements of each line. Then need to create rdd2 with the index and third element of each row. I am new to Scala, can you please help me with how to do this ?
This does not work since y is not of type Vector but org.apache.spark.mllib.linalg.Vector
val rdd1 = indexedData.map{case (x,y) => (x,y.take(2))}
Basically how to get he first two elements of such a vector ?
Thanks.
You can make use of DenseVector's unapply method to get the underlying Array[Double] in your pattern-matching, and then call take/drop on the Array, re-wrapping it with a Vector:
val rdd1 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.take(2))) }
val rdd2 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.drop(2))) }
As you can see - this means the original DenseVector you created isn't really that useful, so if you're not going to use indexedData anywhere else, it might be better to create indexedData as a RDD[(Long, Array[Double])] in the first place:
val points = data.map(s => s.split(',').map(_.toDouble))
val indexedData: RDD[(Long, Array[Double])] = points.zipWithIndex().map(_.swap)
val rdd1 = indexedData.mapValues(arr => Vectors.dense(arr.take(2)))
val rdd2 = indexedData.mapValues(arr => Vectors.dense(arr.drop(2)))
Last tip: you probably want to call .cache() on indexedData before scanning it twice to createrdd1 and rdd2 - otherwise the file will be loaded and parsed twice.
You can achieve the above output by following the below steps:
Original Data:
indexedData.foreach(println)
(0,[61.0,139.0,75.0])
(1,[63.0,140.0,77.0])
(2,[64.0,129.0,82.0])
(3,[68.0,128.0,56.0])
(4,[71.0,140.0,47.0])
(5,[73.0,141.0,38.0])
(6,[75.0,128.0,59.0])
(7,[64.0,129.0,61.0])
(8,[64.0,129.0,80.0])
(9,[64.0,129.0,99.0])
RRD1 Data:
Having index along with first two elements of each line.
val rdd1 = indexedData.map{case (x,y) => (x, (y.toArray(0), y.toArray(1)))}
rdd1.foreach(println)
(0,(61.0,139.0))
(1,(63.0,140.0))
(2,(64.0,129.0))
(3,(68.0,128.0))
(4,(71.0,140.0))
(5,(73.0,141.0))
(6,(75.0,128.0))
(7,(64.0,129.0))
(8,(64.0,129.0))
(9,(64.0,129.0))
RRD2 Data:
Having index along with third element of row.
val rdd2 = indexedData.map{case (x,y) => (x, y.toArray(2))}
rdd2.foreach(println)
(0,75.0)
(1,77.0)
(2,82.0)
(3,56.0)
(4,47.0)
(5,38.0)
(6,59.0)
(7,61.0)
(8,80.0)
(9,99.0)

Create a SparseVector from the elements of RDD

Using Spark, I have a data structure of type val rdd = RDD[(x: Int, y:Int), cov:Double] in Scala, where each element of the RDD represents an element of a matrix with x representing the row, y representing the column and cov representing the value of the element:
I need to create SparseVectors from rows of this matrix. So I decided to first convert the rdd to RDD[x: Int, (y:Int, cov:Double)] and then use groupByKey to put all elements of a specific row together like this:
val rdd2 = rdd.map{case ((x,y),cov) => (x, (y, cov))}.groupByKey()
Now I need to create the SparseVectors:
val N = 7 //Vector Size
val spvec = {(x: Int,y: Iterable[(Int, Double)]) => new SparseVector(N.toLong, Array(y.map(el => el._1.toInt)), Array(y.map(el => el._2.toDouble)))}
val vecs = rdd2.map(spvec)
However, this is the error that pops up.
type mismatch; found :Iterable[Int] required:Int
type mismatch; found :Iterable[Double] required:Double
I am guessing that y.map(el => el._1.toInt) is returning an iterable which Array cannot be applied on. I would appreciate if someone could help with how to do this.
The simplest solution is to convert to RowMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val rdd: RDD[((Int, Int), Double)] = ???
val vs: RDD[org.apache.spark.mllib.linalg.SparseVector]= new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toRowMatrix.rows.map(_.toSparse)
If you want to preserve row indices you can use toIndexedRowMatrix instead:
import org.apache.spark.mllib.linalg.distributed.IndexedRow
new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toIndexedRowMatrix.rows.map { case IndexedRow(i, vs) => (i, vs.toSparse) }

RDD split and do aggregation on new RDDs

I have an RDD of (String,String,Int).
I want to reduce it based on the first two strings
And Then based on the first String I want to group the (String,Int) and sort them
After sorting I need to group them into small groups each containing n elements.
I have done the code below. The problem is the number of elements in the step 2 is very large for a single key
and the reduceByKey(x++y) takes a lot of time.
//Input
val data = Array(
("c1","a1",1), ("c1","b1",1), ("c2","a1",1),("c1","a2",1), ("c1","b2",1),
("c2","a2",1), ("c1","a1",1), ("c1","b1",1), ("c2","a1",1))
val rdd = sc.parallelize(data)
val r1 = rdd.map(x => ((x._1, x._2), (x._3)))
val r2 = r1.reduceByKey((x, y) => x + y ).map(x => ((x._1._1), (x._1._2, x._2)))
// This is taking long time.
val r3 = r2.mapValues(x => ArrayBuffer(x)).reduceByKey((x, y) => x ++ y)
// from the list I will be doing grouping.
val r4 = r3.map(x => (x._1 , x._2.toList.sorted.grouped(2).toList))
Problem is the "c1" has lot of unique entries like b1 ,b2....million and reduceByKey is killing time because all the values are going to single node.
Is there a way to achieve this more efficiently?
// output
Array((c1,List(List((a1,2), (a2,1)), List((b1,2), (b2,1)))), (c2,List(List((a1,2), (a2,1)))))
There at least few problems with a way you group your data. The first problem is introduced by
mapValues(x => ArrayBuffer(x))
It creates a large amount of mutable objects which provide no additional value since you cannot leverage their mutability in the subsequent reduceByKey
reduceByKey((x, y) => x ++ y)
where each ++ creates a new collection and neither argument can be safely mutated. Since reduceByKey applies map side aggregation situation is even worse and pretty much creates GC hell.
Is there a way to achieve this more efficiently?
Unless you have some deeper knowledge about data distribution which can be used to define smarter partitioner the simplest improvement is to replace mapValues + reduceByKey with simple groupByKey:
val r3 = r2.groupByKey
It should be also possible to use a custom partitioner for both reduceByKey calls and mapPartitions with preservesPartitioning instead of map.
class FirsElementPartitioner(partitions: Int)
extends org.apache.spark.Partitioner {
def numPartitions = partitions
def getPartition(key: Any): Int = {
key.asInstanceOf[(Any, Any)]._1.## % numPartitions
}
}
val r2 = r1
.reduceByKey(new FirsElementPartitioner(8), (x, y) => x + y)
.mapPartitions(iter => iter.map(x => ((x._1._1), (x._1._2, x._2))), true)
// No shuffle required here.
val r3 = r2.groupByKey
It requires only a single shuffle and groupByKey is simply a local operations:
r3.toDebugString
// (8) MapPartitionsRDD[41] at groupByKey at <console>:37 []
// | MapPartitionsRDD[40] at mapPartitions at <console>:35 []
// | ShuffledRDD[39] at reduceByKey at <console>:34 []
// +-(8) MapPartitionsRDD[1] at map at <console>:28 []
// | ParallelCollectionRDD[0] at parallelize at <console>:26 []

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}