Spark: How to avoid memory issues on large aggregation with distinct - scala

I start with a following an RDD that represent following data structure, user_id, product_id, bought/or not RDD[(String, String, Int)]
To understand the stats such as how many products that each user has bought I have done a following method:
def userProductAggregation(rdd: org.apache.spark.rdd.RDD[(String, String, Int)]): RDD[(String, Long)] = {
val productPerUserRDD = rdd.filter(_._3 == 1)
.map { case (u, p, _) => (p, u) }
.distinct(numPartitions = 5000)
.map { case (p, _) => (p, 1L) }
.reduceByKey(_ + _, numPartitions = 5000)
return productPerUserRDD
}
Problem is that I get Java Heap issue when I am trying this. My total input size is close to 500GB. In standalone I have set up my spark to --driver-cores 8G --executor-memory 16G --total-executor-cores 80. I think this should be plenty to do this job. Is there a better way to write this method? I thought my approach is very efficient but I am starting to question that. I have also tried to increase the number of partitions up to 8000 but still the issue was same.

Related

Spark: Writing RDD Results to File System is Slow

I'm developing a Spark application with Scala. My application consists of only one operation that requires shuffling (namely cogroup). It runs flawlessly and at a reasonable time. The issue I'm facing is when I want to write the results back to the file system; for some reason, it takes longer than running the actual program. At first, I tried writing the results without re-partitioning or coalescing, and I realized that the number of generated files are huge, so I thought that was the issue. I tried re-partitioning (and coalescing) before writing, but the application took a long time performing these tasks. I know that re-partitioning (and coalescing) is costly, but is what I'm doing the right way? If it's not, could you please give me hints on what's the right approach.
Notes:
My file system is Amazon S3.
My input data size is around 130GB.
My cluster contains a driver node and five slave nodes each has 16 cores and 64 GB of RAM.
I'm assigning 15 executors for my job, each has 5 cores and 19GB of RAM.
P.S. I tried using Dataframes, same issue.
Here is a sample of my code just in case:
val sc = spark.sparkContext
// loading the samples
val samplesRDD = sc
.textFile(s3InputPath)
.filter(_.split(",").length > 7)
.map(parseLine)
.filter(_._1.nonEmpty) // skips any un-parsable lines
// pick random samples
val samples1Ids = samplesRDD
.map(_._2._1) // map to id
.distinct
.takeSample(withReplacement = false, 100, 0)
// broadcast it to the cluster's nodes
val samples1IdsBC = sc broadcast samples1Ids
val samples1RDD = samplesRDD
.filter(samples1IdsBC.value contains _._2._1)
val samples2RDD = samplesRDD
.filter(sample => !samples1IdsBC.value.contains(sample._2._1))
// compute
samples1RDD
.cogroup(samples2RDD)
.flatMapValues { case (left, right) =>
left.map(sample1 => (sample1._1, right.filter(sample2 => isInRange(sample1._2, sample2._2)).map(_._1)))
}
.map {
case (timestamp, (sample1Id, sample2Ids)) =>
s"$timestamp,$sample1Id,${sample2Ids.mkString(";")}"
}
.repartition(10)
.saveAsTextFile(s3OutputPath)
UPDATE
Here is the same code using Dataframes:
// loading the samples
val samplesDF = spark
.read
.csv(inputPath)
.drop("_c1", "_c5", "_c6", "_c7", "_c8")
.toDF("id", "timestamp", "x", "y")
.withColumn("x", ($"x" / 100.0f).cast(sql.types.FloatType))
.withColumn("y", ($"y" / 100.0f).cast(sql.types.FloatType))
// pick random ids as samples 1
val samples1Ids = samplesDF
.select($"id") // map to the id
.distinct
.rdd
.takeSample(withReplacement = false, 1000)
.map(r => r.getAs[String]("id"))
// broadcast it to the executor
val samples1IdsBC = sc broadcast samples1Ids
// get samples 1 and 2
val samples1DF = samplesDF
.where($"id" isin (samples1IdsBC.value: _*))
val samples2DF = samplesDF
.where(!($"id" isin (samples1IdsBC.value: _*)))
samples2DF
.withColumn("combined", struct("id", "lng", "lat"))
.groupBy("timestamp")
.agg(collect_list("combined").as("combined_list"))
.join(samples1DF, Seq("timestamp"), "rightouter")
.map {
case Row(timestamp: String, samples: mutable.WrappedArray[GenericRowWithSchema], sample1Id: String, sample1X: Float, sample1Y: Float) =>
val sample2Info = samples.filter {
case Row(_, sample2X: Float, sample2Y: Float) =>
Misc.isInRange((sample2X, sample2Y), (sample1X, sample1Y), 20)
case _ => false
}.map {
case Row(sample2Id: String, sample2X: Float, sample2Y: Float) =>
s"$sample2Id:$sample2X:$sample2Y"
case _ => ""
}.mkString(";")
(timestamp, sample1Id, sample1X, sample1Y, sample2Info)
case Row(timestamp: String, _, sample1Id: String, sample1X: Float, sample1Y: Float) => // no overlapping samples
(timestamp, sample1Id, sample1X, sample1Y, "")
case _ =>
("error", "", 0.0f, 0.0f, "")
}
.where($"_1" notEqual "error")
// .show(1000, truncate = false)
.write
.csv(outputPath)
Issue here is that normally spark commit tasks, jobs by renaming files, and on S3 renames are really, really slow. The more data you write, the longer it takes at the end of the job. That what you are seeing.
Fix: switch to the S3A committers, which don't do any renames.
Some tuning options to massively increase the number of threads in IO, commits and connection pool size
fs.s3a.threads.max from 10 to something bigger
fs.s3a.committer.threads -number files committed by a POST in parallel; default is 8
fs.s3a.connection.maximum + try (fs.s3a.committer.threads + fs.s3a.threads.max + 10)
These are all fairly small as many jobs work with multiple buckets and if there were big numbers for each it'd be really expensive to create an s3a client...but if you have many thousands of files, probably worthwhile.

Use combineByKey to get output as (key, iterable[values])

I am trying to transform RDD(key,value) to RDD(key,iterable[value]), same as output returned by the groupByKey method.
But as groupByKey is not efficient, I am trying to use combineByKey on the RDD instead, however, it is not working. Below is the code used:
val data= List("abc,2017-10-04,15.2",
"abc,2017-10-03,19.67",
"abc,2017-10-02,19.8",
"xyz,2017-10-09,46.9",
"xyz,2017-10-08,48.4",
"xyz,2017-10-07,87.5",
"xyz,2017-10-04,83.03",
"xyz,2017-10-03,83.41",
"pqr,2017-09-30,18.18",
"pqr,2017-09-27,18.2",
"pqr,2017-09-26,19.2",
"pqr,2017-09-25,19.47",
"abc,2017-07-19,96.60",
"abc,2017-07-18,91.68",
"abc,2017-07-17,91.55")
val rdd = sc.parallelize(templines)
val rows = rdd.map(line => {
val row = line.split(",")
((row(0), row(1)), row(2))
})
// re partition and sort based key
val op = rows.repartitionAndSortWithinPartitions(new CustomPartitioner(4))
val temp = op.map(f => (f._1._1, (f._1._2, f._2)))
val mergeCombiners = (t1: (String, List[String]), t2: (String, List[String])) =>
(t1._1 + t2._1, t1._2.++(t2._2))
val mergeValue = (x: (String, List[String]), y: (String, String)) => {
val a = x._2.+:(y._2)
(x._1, a)
}
// createCombiner, mergeValue, mergeCombiners
val x = temp.combineByKey(
(t1: String, t2: String) => (t1, List(t2)),
mergeValue,
mergeCombiners)
temp.combineByKey is giving compile time error, I am not able to get it.
If you want a output similar from what groupByKey will give you, then you should absolutely use groupByKey and not some other method. The reduceByKey, combineByKey, etc. are only more efficient compared to using groupByKey followed with an aggregation (giving you the same result as one of the other groupBy methods could have given).
As the wanted result is an RDD[key,iterable[value]], building the list yourself or letting groupByKey do it will result in the same amount of work. There is no need to reimplement groupByKey yourself. The problem with groupByKey is not its implementation but lies in the distributed architecture.
For more information regarding groupByKey and these types of optimizations, I would recommend reading more here.

akka stream parallelism and performance

I am learning akka streams and I am not sure to fully understand the performance difference between these 2 codes when running on my laptop with 2 cores and 8 GB of RAM.
val f = Source(1 to numberOfFiles)
.mapAsyncUnordered(numberOfFiles) { _ =>
val fileName = UUID.randomUUID().toString
println(fileName)
Source(1 to numberOfCustomers).mapAsyncUnordered(numberOfCustomers){ _ =>
val rMsisdn = TestUtils.randomString(8)
Future(List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _))
}.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))
}
.runForeach(io=> println(io.status))
and this one:
val f = Source(1 to numberOfFiles)
.mapAsyncUnordered(numberOfFiles) { _ =>
val fileName = UUID.randomUUID().toString
println(fileName)
Source(1 to numberOfCustomers).map{ _ =>
val rMsisdn = TestUtils.randomString(8)
List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _)
}.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))
}
.runForeach(io=> println(io.status))
The second one provides better performance and the difference is going more and more important with more load (more files to write and customers to generate).
My assumptions is that the random generation is not so complicated, so parallelizing it with mapAsync has a higher cost than just running it sequentially. Am I right ?
What I don't understand is the fact that the difference increase with the number of customer. The more I have the higher the difference is between sequential generation and parallel generation.
Is it also coming from the fact that I have a stream in a stream ? Is it inefficient to have 2 levels of parallelism one in another ?
Thanks for your explanation and if you have any suggestion to tune this code don't hesitate !
Edit
New try with a flatMapConcat as suggested, but I still have an issue with the filename (doesn't compile). I don't know how to use the first element of the tuple as the filename of the sink ?
val f = Source(1 to numberOfFiles)
.map{ i =>
val fileName = UUID.randomUUID().toString
println(fileName)
fileName
}
.flatMapConcat { f =>
Source(1 to numberOfCustomers).map{ p =>
val rMsisdn = TestUtils.randomString(8)
(f,List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _))
}
}
.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))

Optimizing cartesian product using keys in spark

To avoid computing all possible combinations, I'm trying to group values according to a certain key, and then compute the cartesian product of the values for each key, i.e.:
Input [(k1, v1), (k1, v2), (k2, v3)]
Desired output: [(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)] Here is code I have tried executing:
val input = sc.textFile('data.csv')
val rdd = input.map(s=>s.split(","))
.map(s => (s(1).toString, s(2).toString))
val group_result:RDD[String, Iterable[String]] = rdd.groupByKey()
group_result.flatMap { t =>
{
val stream1= t._2.toStream
val stream2= t._2.toStream
stream1.flatMap { src =>
stream2.par.map { trg =>
src + "," + trg
}
}
}
}
This works fine for very small files, but when the list(Iterable) is of length ~1000 the computation completely freezes.
As #zero323 said, the best way to solve this is by using PairRDDFunctions' method join, however in order to achieve this you need to have a PairedRDD, which can be obtained by using RDD's method keyBy.
You could do something like:
val rdd = sc.parallelize(Array(("k1", "v1"), ("k1", "v2"), ("k2", "v3"))).keyBy(_._1)
val result = rdd.join(rdd).map{
case (key: String, (x: Tuple2[String, String], y: Tuple2[String, String])) => (x._2, y._2)
}
result.take(20)
// res9: Array[(String, String)] = Array((v1,v1), (v1,v2), (v2,v1), (v2,v2), (v3, v3))
Here I share the notebook with the code.

take top N after groupBy and treat them as RDD

I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)] where List[Int] values are flatten
The data is
val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2,
"bar"->6, "bar"->5, "bar"->4))
The top N items per group are computed as:
val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map {
case (key, numbers) =>
key -> numbers.toList.sortBy(-_).take(2)
}
The result is
(bar,List(6, 5))
(foo,List(3, 2))
which was printed by
topNPerGroup.collect.foreach(println)
If I achieve, topNPerGroup.collect.foreach(println) will generate (expected result!)
(bar, 6)
(bar, 5)
(foo, 3)
(foo, 2)
I've been struggling with this same issue recently but my need was a little different in that I needed the top K values per key with a data set like (key: Int, (domain: String, count: Long)). While your dataset is simpler there is still a scaling/performance issue by using groupByKey as noted in the documentation.
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or combineByKey will yield much better performance.
In my case I ran into problems very quickly because my Iterable in (K, Iterable<V>) was very large, > 1 million, so the sorting and taking of the top N became very expensive and creates potential memory issues.
After some digging, see references below, here is a full example using combineByKey to accomplish the same task in a way that will perform and scale.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object TopNForKey {
var SampleDataset = List(
(1, ("apple.com", 3L)),
(1, ("google.com", 4L)),
(1, ("stackoverflow.com", 10L)),
(1, ("reddit.com", 15L)),
(2, ("slashdot.org", 11L)),
(2, ("samsung.com", 1L)),
(2, ("apple.com", 9L)),
(3, ("microsoft.com", 5L)),
(3, ("yahoo.com", 3L)),
(3, ("google.com", 4L)))
//sort and trim a traversable (String, Long) tuple by _2 value of the tuple
def topNs(xs: TraversableOnce[(String, Long)], n: Int) = {
var ss = List[(String, Long)]()
var min = Long.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e._2 > min) {
ss = (e :: ss).sortBy((f) => f._2)
min = ss.head._2
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head._2
len -= 1
}
}
ss
}
def main(args: Array[String]): Unit = {
val topN = 2
val sc = new SparkContext("local", "TopN For Key")
val rdd = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
//use combineByKey to allow spark to partition the sorting and "trimming" across the cluster
val topNForKey = rdd.combineByKey(
//seed a list for each key to hold your top N's with your first record
(v) => List[(String, Long)](v),
//add the incoming value to the accumulating top N list for the key
(acc: List[(String, Long)], v) => topNs(acc ++ List((v._1, v._2)), topN).toList,
//merge top N lists returned from each partition into a new combined top N list
(acc: List[(String, Long)], acc2: List[(String, Long)]) => topNs(acc ++ acc2, topN).toList)
//print results sorting for pretty
topNForKey.sortByKey(true).foreach((t) => {
println(s"key: ${t._1}")
t._2.foreach((v) => {
println(s"----- $v")
})
})
}
}
And what I get in the returning rdd...
(1, List(("google.com", 4L),
("stackoverflow.com", 10L))
(2, List(("apple.com", 9L),
("slashdot.org", 15L))
(3, List(("google.com", 4L),
("microsoft.com", 5L))
References
https://www.mail-archive.com/user#spark.apache.org/msg16827.html
https://stackoverflow.com/a/8275562/807318
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
Spark 1.4.0 solves the question.
Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7
This uses BoundedPriorityQueue with aggregateByKey
def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
seqOp = (queue, item) => {
queue += item
},
combOp = (queue1, queue2) => {
queue1 ++= queue2
}
).mapValues(_.toArray.sorted(ord.reverse)) // This is an min-heap, so we reverse the order.
}
Your question is a little confusing, but I think this does what you're looking for:
val flattenedTopNPerGroup =
topNPerGroup.flatMap({case (key, numbers) => numbers.map(key -> _)})
and in the repl it prints out what you want:
flattenedTopNPerGroup.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
Just use topByKey:
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.rdd.RDD
val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
topTwo.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:
data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))