How to reduce shuffling and time taken by Spark while making a map of items? - scala

I am using spark to read a csv file like this :
x, y, z
x, y
x
x, y, c, f
x, z
I want to make a map of items vs their count. This is the code I wrote :
private def genItemMap[Item: ClassTag](data: RDD[Array[Item]], partitioner: HashPartitioner): mutable.Map[Item, Long] = {
val immutableFreqItemsMap = data.flatMap(t => t)
.map(v => (v, 1L))
.reduceByKey(partitioner, _ + _)
.collectAsMap()
val freqItemsMap = mutable.Map(immutableFreqItemsMap.toSeq: _*)
freqItemsMap
}
When I run it, it is taking a lot of time and shuffle space. Is there a way to reduce the time?
I have a 2 node cluster with 2 cores each and 8 partitions. The number of lines in the csv file are 170000.

If you just want to do an unique item count thing, then I suppose you can take the following approach.
val data: RDD[Array[Item]] = ???
val itemFrequency = data
.flatMap(arr =>
arr.map(item => (item, 1))
)
.reduceByKey(_ + _)
Do not provide any partitioner while reducing, otherwise it will cause re-shuffling. Just keep it with the partitioning it already had.
Also... do not collect the distributed data into a local in-memory object like a Map.

Related

Aggregation of multiple values using scala/spark

I am new with spark and scala. I want to sum up all the values present in the RDD. below is the example.
RDD is key value pair and suppose after doing some join and transformation the output of RDD have 3 record as below, where A is key:
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
Now i want to sum up all values of each record with corresponding value in other records, so the output should come like
(A, List(3,3,3,3,3,3,3))
Can anyone please help me out on this. Is there any possible way to achieve this using scala?
Big Thanks in Advance
A naive approach is to reduceByKey:
rdd.reduceByKey(
(xs, ys) => xs.zip(ys).map { case (x, y) => x + y }
)
but it is rather inefficient because it creates a new List on each merge.
You can improve on that by using for example aggregateByKey with mutable buffer:
rdd.aggregateByKey(Array.fill(7)(0)) // Mutable buffer
// For seqOp we'll mutate accumulator
(acc, xs) => {
for {
(x, i) <- xs.zipWithIndex
} acc(i) += x
acc
},
// For performance you could modify acc1 as above
(acc1, acc2) => acc1.zip(acc2).map { case(x, y) => x + y }
).mapValues(_.toList)
It should be also possible to use DataFrames but by default recent versions schedule aggregations separately so without adjusting configuration it is probably not worth the effort.

RDD split and do aggregation on new RDDs

I have an RDD of (String,String,Int).
I want to reduce it based on the first two strings
And Then based on the first String I want to group the (String,Int) and sort them
After sorting I need to group them into small groups each containing n elements.
I have done the code below. The problem is the number of elements in the step 2 is very large for a single key
and the reduceByKey(x++y) takes a lot of time.
//Input
val data = Array(
("c1","a1",1), ("c1","b1",1), ("c2","a1",1),("c1","a2",1), ("c1","b2",1),
("c2","a2",1), ("c1","a1",1), ("c1","b1",1), ("c2","a1",1))
val rdd = sc.parallelize(data)
val r1 = rdd.map(x => ((x._1, x._2), (x._3)))
val r2 = r1.reduceByKey((x, y) => x + y ).map(x => ((x._1._1), (x._1._2, x._2)))
// This is taking long time.
val r3 = r2.mapValues(x => ArrayBuffer(x)).reduceByKey((x, y) => x ++ y)
// from the list I will be doing grouping.
val r4 = r3.map(x => (x._1 , x._2.toList.sorted.grouped(2).toList))
Problem is the "c1" has lot of unique entries like b1 ,b2....million and reduceByKey is killing time because all the values are going to single node.
Is there a way to achieve this more efficiently?
// output
Array((c1,List(List((a1,2), (a2,1)), List((b1,2), (b2,1)))), (c2,List(List((a1,2), (a2,1)))))
There at least few problems with a way you group your data. The first problem is introduced by
mapValues(x => ArrayBuffer(x))
It creates a large amount of mutable objects which provide no additional value since you cannot leverage their mutability in the subsequent reduceByKey
reduceByKey((x, y) => x ++ y)
where each ++ creates a new collection and neither argument can be safely mutated. Since reduceByKey applies map side aggregation situation is even worse and pretty much creates GC hell.
Is there a way to achieve this more efficiently?
Unless you have some deeper knowledge about data distribution which can be used to define smarter partitioner the simplest improvement is to replace mapValues + reduceByKey with simple groupByKey:
val r3 = r2.groupByKey
It should be also possible to use a custom partitioner for both reduceByKey calls and mapPartitions with preservesPartitioning instead of map.
class FirsElementPartitioner(partitions: Int)
extends org.apache.spark.Partitioner {
def numPartitions = partitions
def getPartition(key: Any): Int = {
key.asInstanceOf[(Any, Any)]._1.## % numPartitions
}
}
val r2 = r1
.reduceByKey(new FirsElementPartitioner(8), (x, y) => x + y)
.mapPartitions(iter => iter.map(x => ((x._1._1), (x._1._2, x._2))), true)
// No shuffle required here.
val r3 = r2.groupByKey
It requires only a single shuffle and groupByKey is simply a local operations:
r3.toDebugString
// (8) MapPartitionsRDD[41] at groupByKey at <console>:37 []
// | MapPartitionsRDD[40] at mapPartitions at <console>:35 []
// | ShuffledRDD[39] at reduceByKey at <console>:34 []
// +-(8) MapPartitionsRDD[1] at map at <console>:28 []
// | ParallelCollectionRDD[0] at parallelize at <console>:26 []

word count(frequency) spark rdd scala

if I have an rdd accross cluster and I want to do the word count
not only count the appear times,
I want to get the frequency, which is defined as count/total count
What is the best and efficient way to do so in scala?
How can I do reduction job and calculate total number at the same time within one workflow?
BTW I know purely word count can be done in this way.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
but what is the difference if I use aggregate? in terms of spark job workflow
val result = pairs
.aggregate(Map[String, Int]())((acc, pair) =>
if(acc.contains(pair._1))
acc ++ Map[String, Int]((pair._1, acc(pair._1)+1))
else
acc ++ Map[String, Int]((pair._1, pair._2))
,
(a, b) =>
(a.toSeq ++ b.toSeq)
.groupBy(_._1)
.mapValues(_.map(_._2).reduce(_ + _))
)
You can use this
val total = counts.map(x => x._2).sum()
val freq = counts.map(x => (x._1, x._2/total))
There exists also the concept of Accumulator which is a write-only variable and you could use it to avoid using the sum() action, but your code would need a lot of change.

Spark: Efficient mass lookup in pair RDD's

In Apache Spark I have two RDD's. The first data : RDD[(K,V)] containing data in key-value form. The second pairs : RDD[(K,K)] contains a set of interesting key-pairs of this data.
How can I efficiently construct an RDD pairsWithData : RDD[((K,K)),(V,V))], such that it contains all the elements from pairs as the key-tuple and their corresponding values (from data) as the value-tuple?
Some properties of the data:
The keys in data are unique
All entries in pairs are unique
For all pairs (k1,k2) in pairs it is guaranteed that k1 <= k2
The size of 'pairs' is only a constant the size of data |pairs| = O(|data|)
Current data sizes (expected to grow): |data| ~ 10^8, |pairs| ~ 10^10
Current attempts
Here is some example code in Scala:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
// This kind of show the idea, but fails at runtime.
def massPairLookup1(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
keyPairs map {case (k1,k2) =>
val v1 : String = data lookup k1 head;
val v2 : String = data lookup k2 head;
((k1, k2), (v1,v2))
}
}
// Works but is O(|data|^2)
def massPairLookup2(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
// Construct all possible pairs of values
val cartesianData = data cartesian data map {case((k1,v1),(k2,v2)) => ((k1,k2),(v1,v2))}
// Select only the values who's keys are in keyPairs
keyPairs map {(_,0)} join cartesianData mapValues {_._2}
}
// Example function that find pairs of keys
// Runs in O(|data|) in real life, but cannot maintain the values
def relevantPairs(data : RDD[(Int, String)]) = {
val keys = data map (_._1)
keys cartesian keys filter {case (x,y) => x*y == 12 && x < y}
}
// Example run
val data = sc parallelize(1 to 12) map (x => (x, "Number " + x))
val pairs = relevantPairs(data)
val pairsWithData = massPairLookup2(pairs, data)
// Print:
// ((1,12),(Number1,Number12))
// ((2,6),(Number2,Number6))
// ((3,4),(Number3,Number4))
pairsWithData.foreach(println)
Attempt 1
First I tried just using the lookup function on data, but that throws an runtime error when executed. It seems like self is null in the PairRDDFunctions trait.
In addition I am not sure about the performance of lookup. The documentation says This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. This sounds like n lookups takes O(n*|partition|) time at best, which I suspect could be optimized.
Attempt 2
This attempt works, but I create |data|^2 pairs which will kill performance. I do not expect Spark to be able to optimize that away.
Your lookup 1 doesn't work because you cannot perform RDD transformations inside workers (inside another transformation).
In the lookup 2, I don't think it's necessary to perform full cartesian...
You can do it like this:
val firstjoin = pairs.map({case (k1,k2) => (k1, (k1,k2))})
.join(data)
.map({case (_, ((k1, k2), v1)) => ((k1, k2), v1)})
val result = firstjoin.map({case ((k1,k2),v1) => (k2, ((k1,k2),v1))})
.join(data)
.map({case(_, (((k1,k2), v1), v2))=>((k1, k2), (v1, v2))})
Or in a more dense form:
val firstjoin = pairs.map(x => (x._1, x)).join(data).map(_._2)
val result = firstjoin.map({case (x,y) => (x._2, (x,y))})
.join(data).map({case(x, (y, z))=>(y._1, (y._2, z))})
I don't think you can do it more efficiently, but I might be wrong...

How can I create a TF-IDF for Text Classification using Spark?

I have a CSV file with the following format :
product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
So I'm reading the file :
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
val tuples = file.map(line => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
Thanks
To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
document_id, [token_ids]
The second is an inverted index like
token_id, [document_ids]
I'll call those corpus and inv_index respectively.
To get tf we need to count the number of occurrences of each token in each document. So
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()
Now we just have to do a join on the term_id:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
If anyone can improve on this, I'm very interested.