word count in spark-scala for rdd(String,String,Long) - scala

I am a new to Spark-scala, trying to solve simple word count(having multiple attributes as keys). can I get some inputs?
I have an Rdd(String,String,Long) like
(a,b,1)
(a,c,1)
(a,c,1)
(b,b,1)
(b,b,1)
desired result is an rdd like
(a,b,1)
(a,c,2)
(b,b,2)

Try:
rdd.map {
case (x, y, c) => ((x, y), c)
}.reduceByKey(_ + _)

Related

How to get minimum value for each distinct key using ReduceByKey() in Scala

I have a flat map that returns the Sequence Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)) now I need to use the reduceByKey() on the sequence that I got from the flat map to find the minimum value for each key.
I tried using .reduceByKey(a,min(b)) and .reduceByKey((a, b) => if (a._1 < b._1) a else b) but neither of them are working.
This is my code
for(i<- 1 to 5){
var graph=graph.flatMap{ in => in match{ case (x, y, zs) => (x, y) :: zs.map(z => (z, y))}
.reduceByKey((a, b) => if (a._1 < b._1) a else b)
}
For each distinct key the flatmap generates I need to get the minimum value for that key. Eg: the flatmap generates Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)) the resultByKey() should generate (20,1),(22,1),(23,6),(24,6)
Here is the signature of reduceByKey:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Basically, given a RDD of key/value pairs, you need to provide a function that reduces two values (and not the entire pair) into one. Therefore, you can use it as follows:
val rdd = sc.parallelize(Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)))
val result = rdd.reduceByKey((a, b) => if (a < b) a else b)
result.collect
// Array[(Int, Int)] = Array((24,6), (20,1), (22,1), (23,6))

How to reduce shuffling and time taken by Spark while making a map of items?

I am using spark to read a csv file like this :
x, y, z
x, y
x
x, y, c, f
x, z
I want to make a map of items vs their count. This is the code I wrote :
private def genItemMap[Item: ClassTag](data: RDD[Array[Item]], partitioner: HashPartitioner): mutable.Map[Item, Long] = {
val immutableFreqItemsMap = data.flatMap(t => t)
.map(v => (v, 1L))
.reduceByKey(partitioner, _ + _)
.collectAsMap()
val freqItemsMap = mutable.Map(immutableFreqItemsMap.toSeq: _*)
freqItemsMap
}
When I run it, it is taking a lot of time and shuffle space. Is there a way to reduce the time?
I have a 2 node cluster with 2 cores each and 8 partitions. The number of lines in the csv file are 170000.
If you just want to do an unique item count thing, then I suppose you can take the following approach.
val data: RDD[Array[Item]] = ???
val itemFrequency = data
.flatMap(arr =>
arr.map(item => (item, 1))
)
.reduceByKey(_ + _)
Do not provide any partitioner while reducing, otherwise it will cause re-shuffling. Just keep it with the partitioning it already had.
Also... do not collect the distributed data into a local in-memory object like a Map.

how to merge 2 different rdd in spark using scala

I m trying to merge 2 rdds to one. If my rdd1 consists of 2 records of 2 elements both are strings ex:
key_A:value_A and Key_B:value_B
rdd2 also consists of 1 record of 2 elements both of which are strings
key_C:value_c
my final rdd would look like this:
key_A :value_A , Key_B :value_B , key_C :value_c
we can use union method of rdd but its not working . Plz kindly help
while using union of 2 rdds should the row of the 2 differnt rdd contain the same no of elments or there size can differ.......??
Try with join:
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
See the associated section of the docs
union is working.
Sample code is:
val rdd = sparkContext.parallelize(1 to 10, 3)
val pairRDD = rdd.map { x => (x, x) }
val rdd1 = sparkContext.parallelize(11 to 20, 3)
val pairRDD1 = rdd1.map { x => (x, x) }
pairRDD.union(pairRDD1).foreach(tuple => {
println(tuple._1)
println(tuple._2)
})

Aggregation of multiple values using scala/spark

I am new with spark and scala. I want to sum up all the values present in the RDD. below is the example.
RDD is key value pair and suppose after doing some join and transformation the output of RDD have 3 record as below, where A is key:
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
Now i want to sum up all values of each record with corresponding value in other records, so the output should come like
(A, List(3,3,3,3,3,3,3))
Can anyone please help me out on this. Is there any possible way to achieve this using scala?
Big Thanks in Advance
A naive approach is to reduceByKey:
rdd.reduceByKey(
(xs, ys) => xs.zip(ys).map { case (x, y) => x + y }
)
but it is rather inefficient because it creates a new List on each merge.
You can improve on that by using for example aggregateByKey with mutable buffer:
rdd.aggregateByKey(Array.fill(7)(0)) // Mutable buffer
// For seqOp we'll mutate accumulator
(acc, xs) => {
for {
(x, i) <- xs.zipWithIndex
} acc(i) += x
acc
},
// For performance you could modify acc1 as above
(acc1, acc2) => acc1.zip(acc2).map { case(x, y) => x + y }
).mapValues(_.toList)
It should be also possible to use DataFrames but by default recent versions schedule aggregations separately so without adjusting configuration it is probably not worth the effort.

SortByValue for a RDD of tuples

Recently I was asked (in a class assignment) to find the top 10 occurring words inside RDD. I submitted my assignment with a working solution which looks like
wordsRdd
.map(x => (x, 1))
.reduceByKey(_ + _)
.map(case (x, y) => (y, x))
.sortByKey(false)
.map(case (x, y) => (y, x))
.take(10)
So basically, I swap the tuple, sort by key, and then swap again. Then finally take 10. I don't find the repeated swapping very elegant.
So I wonder if there is a more elegant way of doing this.
I searched and found some people using Scala implicits to convert the RDD into a Scala Sequence and then doing the sortByValue, but I don't want to convert RDD to a Scala Seq, because that will kill the distributed nature of the RDD.
So is there a better way?
How about this:
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(Ordering.by(-1 * _._2))
or a little bit more verbose:
object WordCountPairsOrdering extends Ordering[(String, Int)] {
def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(WordCountPairsOrdering)