I'm developing a Spark application with Scala. My application consists of only one operation that requires shuffling (namely cogroup). It runs flawlessly and at a reasonable time. The issue I'm facing is when I want to write the results back to the file system; for some reason, it takes longer than running the actual program. At first, I tried writing the results without re-partitioning or coalescing, and I realized that the number of generated files are huge, so I thought that was the issue. I tried re-partitioning (and coalescing) before writing, but the application took a long time performing these tasks. I know that re-partitioning (and coalescing) is costly, but is what I'm doing the right way? If it's not, could you please give me hints on what's the right approach.
Notes:
My file system is Amazon S3.
My input data size is around 130GB.
My cluster contains a driver node and five slave nodes each has 16 cores and 64 GB of RAM.
I'm assigning 15 executors for my job, each has 5 cores and 19GB of RAM.
P.S. I tried using Dataframes, same issue.
Here is a sample of my code just in case:
val sc = spark.sparkContext
// loading the samples
val samplesRDD = sc
.textFile(s3InputPath)
.filter(_.split(",").length > 7)
.map(parseLine)
.filter(_._1.nonEmpty) // skips any un-parsable lines
// pick random samples
val samples1Ids = samplesRDD
.map(_._2._1) // map to id
.distinct
.takeSample(withReplacement = false, 100, 0)
// broadcast it to the cluster's nodes
val samples1IdsBC = sc broadcast samples1Ids
val samples1RDD = samplesRDD
.filter(samples1IdsBC.value contains _._2._1)
val samples2RDD = samplesRDD
.filter(sample => !samples1IdsBC.value.contains(sample._2._1))
// compute
samples1RDD
.cogroup(samples2RDD)
.flatMapValues { case (left, right) =>
left.map(sample1 => (sample1._1, right.filter(sample2 => isInRange(sample1._2, sample2._2)).map(_._1)))
}
.map {
case (timestamp, (sample1Id, sample2Ids)) =>
s"$timestamp,$sample1Id,${sample2Ids.mkString(";")}"
}
.repartition(10)
.saveAsTextFile(s3OutputPath)
UPDATE
Here is the same code using Dataframes:
// loading the samples
val samplesDF = spark
.read
.csv(inputPath)
.drop("_c1", "_c5", "_c6", "_c7", "_c8")
.toDF("id", "timestamp", "x", "y")
.withColumn("x", ($"x" / 100.0f).cast(sql.types.FloatType))
.withColumn("y", ($"y" / 100.0f).cast(sql.types.FloatType))
// pick random ids as samples 1
val samples1Ids = samplesDF
.select($"id") // map to the id
.distinct
.rdd
.takeSample(withReplacement = false, 1000)
.map(r => r.getAs[String]("id"))
// broadcast it to the executor
val samples1IdsBC = sc broadcast samples1Ids
// get samples 1 and 2
val samples1DF = samplesDF
.where($"id" isin (samples1IdsBC.value: _*))
val samples2DF = samplesDF
.where(!($"id" isin (samples1IdsBC.value: _*)))
samples2DF
.withColumn("combined", struct("id", "lng", "lat"))
.groupBy("timestamp")
.agg(collect_list("combined").as("combined_list"))
.join(samples1DF, Seq("timestamp"), "rightouter")
.map {
case Row(timestamp: String, samples: mutable.WrappedArray[GenericRowWithSchema], sample1Id: String, sample1X: Float, sample1Y: Float) =>
val sample2Info = samples.filter {
case Row(_, sample2X: Float, sample2Y: Float) =>
Misc.isInRange((sample2X, sample2Y), (sample1X, sample1Y), 20)
case _ => false
}.map {
case Row(sample2Id: String, sample2X: Float, sample2Y: Float) =>
s"$sample2Id:$sample2X:$sample2Y"
case _ => ""
}.mkString(";")
(timestamp, sample1Id, sample1X, sample1Y, sample2Info)
case Row(timestamp: String, _, sample1Id: String, sample1X: Float, sample1Y: Float) => // no overlapping samples
(timestamp, sample1Id, sample1X, sample1Y, "")
case _ =>
("error", "", 0.0f, 0.0f, "")
}
.where($"_1" notEqual "error")
// .show(1000, truncate = false)
.write
.csv(outputPath)
Issue here is that normally spark commit tasks, jobs by renaming files, and on S3 renames are really, really slow. The more data you write, the longer it takes at the end of the job. That what you are seeing.
Fix: switch to the S3A committers, which don't do any renames.
Some tuning options to massively increase the number of threads in IO, commits and connection pool size
fs.s3a.threads.max from 10 to something bigger
fs.s3a.committer.threads -number files committed by a POST in parallel; default is 8
fs.s3a.connection.maximum + try (fs.s3a.committer.threads + fs.s3a.threads.max + 10)
These are all fairly small as many jobs work with multiple buckets and if there were big numbers for each it'd be really expensive to create an s3a client...but if you have many thousands of files, probably worthwhile.
Related
So I currently have an akka stream to read a list of files, and a sink to concatenate them, and that works just fine:
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val sink = Sink.fold[ByteString, ByteString](ByteString(""))(_ ++ ByteString("\n" ++ _) // Concatenate
source.toMat(sink)(Keep.right).run().flatMap(concatByteStr => writeByteStrToFile(concatByteStr, "an-output-file.txt"))
While this is fine for a simple case, the files are rather large (on the order of GBs, and can't fit in the memory of the machine I'm running this application on. So I'd like to chunk it after the byte string has reached a certain size. An option is doing it with Source.grouped(N), but files vary greatly in size (from 1 KB to 2 GB), so there's no guarantee on normalizing the size of the file.
My question is if there's a way to chunk writing files by the size of the bytestring. The documentation of akka streams are quite overwhelming and I'm having trouble figuring out the library. Any help would be greatly appreciated. Thanks!
The FileIO module from Akka Streams provides you with a streaming IO Sink to write to files, and utility methods to chunk a stream of ByteString. Your example would become something along the lines of
val files = List("a.txt", "b.txt", "c.txt") // and so on;
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
val sink: Sink[ByteString, Future[IOResult]] = FileIO.toPath(Paths.get("an-output-file.txt"))
source.via(chunking).runWith(sink)
Using FileIO.toPath sink avoids storing the whole folded ByteString into memory (hence allowing proper streaming).
More details on this Akka module can be found in the docs.
I think #Stefano Bonetti already offered a great solution. Just wanted to add that, one could also consider building custom GraphStage to address specific chunking need. In essence, create a chunk emitting method like below for the In/Out handlers as described in this Akka Stream link:
private def emitChunk(): Unit = {
if (buffer.isEmpty) {
if (isClosed(in)) completeStage()
else pull(in)
} else {
val (chunk, nextBuffer) = buffer.splitAt(chunkSize)
buffer = nextBuffer
push(out, chunk)
}
}
After a week of tinkering in the Akka Streams libraries, the solution I ended with was a combination of Stefano's answer along with a solution provided here. I read the source of files line by line via the Framing.delimiter function, and then just simply use the LogRotatorSink provided by Alpakka. The meat of the determining log rotation is here:
val fileSizeRotationFunction = () => {
val max = 10 * 1024 * 1024 // 10 MB, but whatever you really want; I had it at our HDFS block size
var size: Long = max
(element: ByteString) =>
{
if (size + element.size > max) {
val path = Files.createTempFile("out-", ".log")
size = element.size
Some(path)
} else {
size += element.size
None
}
}
}
val sizeRotatorSink: Sink[ByteString, Future[Done]] =
LogRotatorSink(fileSizeRotationFunction)
val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
source.via(chunking).runWith(sizeRotatorSink)
And that's it. Hope this was helpful to others.
I have two rdd's namely val tab_a: RDD[(String, String)] and val tab_b: RDD[(String, String)] I'm using cogroup for those datasets like:
val tab_c = tab_a.cogroup(tab_b).collect.toArray
val updated = tab_c.map { x =>
{
//somecode
}
}
I'm using tab_c cogrouped values for map function and it works fine for small datasets but in case of huge datasets it throws Out Of Memory exception.
I have tried converting the final value to RDD but no luck same error
val newcos = spark.sparkContext.parallelize(tab_c)
1.How to use Cogroup for large datasets ?
2.Can we persist the cogrouped value ?
Code
val source_primary_key = source.map(rec => (rec.split(",")(0), rec))
source_primary_key.persist(StorageLevel.DISK_ONLY)
val destination_primary_key = destination.map(rec => (rec.split(",")(0), rec))
destination_primary_key.persist(StorageLevel.DISK_ONLY)
val cos = source_primary_key.cogroup(destination_primary_key).repartition(10).collect()
var srcmis: Array[String] = new Array[String](0)
var destmis: Array[String] = new Array[String](0)
var extrainsrc: Array[String] = new Array[String](0)
var extraindest: Array[String] = new Array[String](0)
var srcs: String = Seq("")(0)
var destt: String = Seq("")(0)
val updated = cos.map { x =>
{
val key = x._1
val value = x._2
srcs = value._1.mkString(",")
destt = value._2.mkString(",")
if (srcs.equalsIgnoreCase(destt) == false && destt != "") {
srcmis :+= srcs
destmis :+= destt
}
if (srcs == "") {
extraindest :+= destt.mkString("")
}
if (destt == "") {
extrainsrc :+= srcs.mkString("")
}
}
}
Code Updated:
val tab_c = tab_a.cogroup(tab_b).filter(x => x._2._1 =!= x => x._2._2)
// tab_c = {1,Compactbuffer(1,john,US),Compactbuffer(1,john,UK)}
{2,Compactbuffer(2,john,US),Compactbuffer(2,johnson,UK)}..
ERROR:
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(4,3,ResultTask,FetchFailed(null,0,-1,27,org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
ERROR YarnScheduler: Lost executor 8 on datanode1: Container killed by YARN for exceeding memory limits. 1.0 GB of 1020 MB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Thank you
When you use collect() you are basically telling spark to move all the resulting data back to the master node, which can easily produce a bottleneck. You are no longer using Spark at that point, just a plain array in a single machine.
To trigger computation just use something that requires the data at every node, that's why executors live on top of a distributed file system. For instance saveAsTextFile().
Here are some basic examples.
Remember, the entire objective here (that is, if you have big data) is to move the code to your data and compute there, not to bring all the data to the computation.
TL;DR Don't collect.
To run this code safely, without additional assumptions (on average requirements for worker nodes might be significantly smaller), every node (driver and each executor) would require memory significantly exceeding total memory requirements for all data.
If you were to run it outside Spark you would need only one node. Therefore Spark provides no benefits here.
However if you skip collect.toArray and make some assumptions about data distribution you might run it just fine.
I am learning akka streams and I am not sure to fully understand the performance difference between these 2 codes when running on my laptop with 2 cores and 8 GB of RAM.
val f = Source(1 to numberOfFiles)
.mapAsyncUnordered(numberOfFiles) { _ =>
val fileName = UUID.randomUUID().toString
println(fileName)
Source(1 to numberOfCustomers).mapAsyncUnordered(numberOfCustomers){ _ =>
val rMsisdn = TestUtils.randomString(8)
Future(List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _))
}.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))
}
.runForeach(io=> println(io.status))
and this one:
val f = Source(1 to numberOfFiles)
.mapAsyncUnordered(numberOfFiles) { _ =>
val fileName = UUID.randomUUID().toString
println(fileName)
Source(1 to numberOfCustomers).map{ _ =>
val rMsisdn = TestUtils.randomString(8)
List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _)
}.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))
}
.runForeach(io=> println(io.status))
The second one provides better performance and the difference is going more and more important with more load (more files to write and customers to generate).
My assumptions is that the random generation is not so complicated, so parallelizing it with mapAsync has a higher cost than just running it sequentially. Am I right ?
What I don't understand is the fact that the difference increase with the number of customer. The more I have the higher the difference is between sequential generation and parallel generation.
Is it also coming from the fact that I have a stream in a stream ? Is it inefficient to have 2 levels of parallelism one in another ?
Thanks for your explanation and if you have any suggestion to tune this code don't hesitate !
Edit
New try with a flatMapConcat as suggested, but I still have an issue with the filename (doesn't compile). I don't know how to use the first element of the tuple as the filename of the sink ?
val f = Source(1 to numberOfFiles)
.map{ i =>
val fileName = UUID.randomUUID().toString
println(fileName)
fileName
}
.flatMapConcat { f =>
Source(1 to numberOfCustomers).map{ p =>
val rMsisdn = TestUtils.randomString(8)
(f,List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _))
}
}
.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))
I am trying to run the simplest program with Spark
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (1 to 10000000).toList
val data = sc.parallelize(dat).cache()
for(i <- 1 to 100){
println(data.reduce(_ + _))
}
}
}
I get the following error message, after each iteration :
WARN TaskSetManager: Stage 0 contains a task of very large size (9767
KB). The maximum recommended task size is 100 KB.
Increasing the data size increases said task size. This suggests to me that the driver is shipping the "dat" object to all executors, but I can't for the life of me see why, as the only operation on my RDD is reduce, which basically has no closure. Any ideas ?
Because you create the very large list locally first, the Spark parallelize method is trying to ship this list to the Spark workers as a single unit, as part of a task. Hence the warning message you receive. As an alternative, you could parallelize a much smaller list, then use flatMap to explode it into the larger list. this also has the benefit of creating the larger set of numbers in parallel. For example:
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest extends App {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (0 to 99).toList
val data = sc.parallelize(dat).cache().flatMap(i => (1 to 1000000).map(j => j * 100 + i))
println(data.count()) //100000000
println(data.reduce(_ + _))
sc.stop()
}
EDIT:
Ultimately the local collection being parallelized has to be pushed to the executors. The parallelize method creates an instance of ParallelCollectionRDD:
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L730
ParallelCollectionRDD creates a number of partitions equal to numSlices:
override def getPartitions: Array[Partition] = {
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L96
numSlices defaults to sc.defaultParallelism which on my machine is 4. So even when split, each partition contains a very large list which needs to be pushed to an executor.
SparkContext.parallelize contains the note #note Parallelize acts lazily and ParallelCollectionRDD contains the comment;
// TODO: Right now, each split sends along its full data, even if
later down the RDD chain it gets // cached. It might be worthwhile
to write the data to a file in the DFS and read it in the split //
instead.
So it appears that the problem happens when you call reduce because this is the point that the partitions are sent to the executors, but the root cause is that you are calling parallelize on a very big list. Generating the large list within the executors is a better approach, IMHO.
Reduce function sends all the data to one single node. When you run sc.parallelize the data is distributed by default to 100 partitions. To make use of the already distributed data use something like this:
data.map(el=> el%100 -> el).reduceByKey(_+_)
or you can do the reduce at partition level.
data.mapPartitions(p => Iterator(p.reduce(_ + _))).reduce(_ + _)
or just use sum :)
I start with a following an RDD that represent following data structure, user_id, product_id, bought/or not RDD[(String, String, Int)]
To understand the stats such as how many products that each user has bought I have done a following method:
def userProductAggregation(rdd: org.apache.spark.rdd.RDD[(String, String, Int)]): RDD[(String, Long)] = {
val productPerUserRDD = rdd.filter(_._3 == 1)
.map { case (u, p, _) => (p, u) }
.distinct(numPartitions = 5000)
.map { case (p, _) => (p, 1L) }
.reduceByKey(_ + _, numPartitions = 5000)
return productPerUserRDD
}
Problem is that I get Java Heap issue when I am trying this. My total input size is close to 500GB. In standalone I have set up my spark to --driver-cores 8G --executor-memory 16G --total-executor-cores 80. I think this should be plenty to do this job. Is there a better way to write this method? I thought my approach is very efficient but I am starting to question that. I have also tried to increase the number of partitions up to 8000 but still the issue was same.