I am learning akka streams and I am not sure to fully understand the performance difference between these 2 codes when running on my laptop with 2 cores and 8 GB of RAM.
val f = Source(1 to numberOfFiles)
.mapAsyncUnordered(numberOfFiles) { _ =>
val fileName = UUID.randomUUID().toString
println(fileName)
Source(1 to numberOfCustomers).mapAsyncUnordered(numberOfCustomers){ _ =>
val rMsisdn = TestUtils.randomString(8)
Future(List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _))
}.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))
}
.runForeach(io=> println(io.status))
and this one:
val f = Source(1 to numberOfFiles)
.mapAsyncUnordered(numberOfFiles) { _ =>
val fileName = UUID.randomUUID().toString
println(fileName)
Source(1 to numberOfCustomers).map{ _ =>
val rMsisdn = TestUtils.randomString(8)
List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _)
}.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))
}
.runForeach(io=> println(io.status))
The second one provides better performance and the difference is going more and more important with more load (more files to write and customers to generate).
My assumptions is that the random generation is not so complicated, so parallelizing it with mapAsync has a higher cost than just running it sequentially. Am I right ?
What I don't understand is the fact that the difference increase with the number of customer. The more I have the higher the difference is between sequential generation and parallel generation.
Is it also coming from the fact that I have a stream in a stream ? Is it inefficient to have 2 levels of parallelism one in another ?
Thanks for your explanation and if you have any suggestion to tune this code don't hesitate !
Edit
New try with a flatMapConcat as suggested, but I still have an issue with the filename (doesn't compile). I don't know how to use the first element of the tuple as the filename of the sink ?
val f = Source(1 to numberOfFiles)
.map{ i =>
val fileName = UUID.randomUUID().toString
println(fileName)
fileName
}
.flatMapConcat { f =>
Source(1 to numberOfCustomers).map{ p =>
val rMsisdn = TestUtils.randomString(8)
(f,List(1 to Random.nextInt(20)).map{ i=>
val rCdr= RandomCdr(rMsisdn)
ByteString(s"${rCdr.msisdn};${rCdr.dateTime};${rCdr.peer};${rCdr.callType};${rCdr.way};${rCdr.duration}\n")
}.fold(ByteString())(_ concat _))
}
}
.runWith(FileIO.toPath(Paths.get(s"/home/reactive/data/$fileName")))
Related
I am struggling with understanding if akka-stream enforces backpressure on Source when having a broadcast with one branch taking a lot of time (asynchronous) in the graph.
I tried buffer and batch to see if there was any backpressure applied on the source but it does not look like it. I also tried flushing System.out but it does not change anything.
object Test extends App {
/* Necessary for akka stream */
implicit val system = ActorSystem("test")
implicit val materializer: ActorMaterializer = ActorMaterializer()
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val in = Source.tick(0 seconds, 1 seconds, 1)
in.runForeach(i => println("Produced " + i))
val out = Sink.foreach(println)
val out2 = Sink.foreach[Int]{ o => println(s"2 $o") }
val bcast = builder.add(Broadcast[Int](2))
val batchedIn: Source[Int, Cancellable] = in.batch(4, identity) {
case (s, v) => println(s"Batched ${s+v}"); s + v
}
val f2 = Flow[Int].map(_ + 10)
val f4 = Flow[Int].map { i => Thread.sleep(2000); i}
batchedIn ~> bcast ~> f2 ~> out
bcast ~> f4.async ~> out2
ClosedShape
})
g.run()
}
I would expect to see "Batched ..." in the console when I am running the program and at some point to have it momentarily stuck because f4 is not fast enough to process the values. At the moment, none of those behave as expected as the numbers are generated continuously and no batch is done.
EDIT: I noticed that after some time, the batch messages start to print out in the console. I still don't know why it does not happen sooner as the backpressure should happen for the first elements
The reason that explains this behavior are internal buffers that are introduced by akka when async boundaries are set.
Buffers for asynchronous operators
internal buffers that are introduced as an optimization when using asynchronous operators.
While pipelining in general increases throughput, in practice there is a cost of passing an element through the asynchronous (and therefore thread crossing) boundary which is significant. To amortize this cost Akka Streams uses a windowed, batching backpressure strategy internally. It is windowed because as opposed to a Stop-And-Wait protocol multiple elements might be “in-flight” concurrently with requests for elements. It is also batching because a new element is not immediately requested once an element has been drained from the window-buffer but multiple elements are requested after multiple elements have been drained. This batching strategy reduces the communication cost of propagating the backpressure signal through the asynchronous boundary.
I understand that this is a toy stream, but if you explain what is your goal I will try to help you.
You need mapAsync instead of async
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import akka.stream.scaladsl.GraphDSL.Implicits._
val in = Source.tick(0 seconds, 1 seconds, 1).map(x => {println(s"Produced ${x}"); x})
val out = Sink.foreach[Int]{ o => println(s"F2 processed $o") }
val out2 = Sink.foreach[Int]{ o => println(s"F4 processed $o") }
val bcast = builder.add(Broadcast[Int](2))
val batchedIn: Source[Int, Cancellable] = in.buffer(4,OverflowStrategy.backpressure)
val f2 = Flow[Int].map(_ + 10)
val f4 = Flow[Int].mapAsync(1) { i => Future { println("F4 Started Processing"); Thread.sleep(2000); i }(system.dispatcher) }
batchedIn ~> bcast ~> f2 ~> out
bcast ~> f4 ~> out2
ClosedShape
}).run()
I'm developing a Spark application with Scala. My application consists of only one operation that requires shuffling (namely cogroup). It runs flawlessly and at a reasonable time. The issue I'm facing is when I want to write the results back to the file system; for some reason, it takes longer than running the actual program. At first, I tried writing the results without re-partitioning or coalescing, and I realized that the number of generated files are huge, so I thought that was the issue. I tried re-partitioning (and coalescing) before writing, but the application took a long time performing these tasks. I know that re-partitioning (and coalescing) is costly, but is what I'm doing the right way? If it's not, could you please give me hints on what's the right approach.
Notes:
My file system is Amazon S3.
My input data size is around 130GB.
My cluster contains a driver node and five slave nodes each has 16 cores and 64 GB of RAM.
I'm assigning 15 executors for my job, each has 5 cores and 19GB of RAM.
P.S. I tried using Dataframes, same issue.
Here is a sample of my code just in case:
val sc = spark.sparkContext
// loading the samples
val samplesRDD = sc
.textFile(s3InputPath)
.filter(_.split(",").length > 7)
.map(parseLine)
.filter(_._1.nonEmpty) // skips any un-parsable lines
// pick random samples
val samples1Ids = samplesRDD
.map(_._2._1) // map to id
.distinct
.takeSample(withReplacement = false, 100, 0)
// broadcast it to the cluster's nodes
val samples1IdsBC = sc broadcast samples1Ids
val samples1RDD = samplesRDD
.filter(samples1IdsBC.value contains _._2._1)
val samples2RDD = samplesRDD
.filter(sample => !samples1IdsBC.value.contains(sample._2._1))
// compute
samples1RDD
.cogroup(samples2RDD)
.flatMapValues { case (left, right) =>
left.map(sample1 => (sample1._1, right.filter(sample2 => isInRange(sample1._2, sample2._2)).map(_._1)))
}
.map {
case (timestamp, (sample1Id, sample2Ids)) =>
s"$timestamp,$sample1Id,${sample2Ids.mkString(";")}"
}
.repartition(10)
.saveAsTextFile(s3OutputPath)
UPDATE
Here is the same code using Dataframes:
// loading the samples
val samplesDF = spark
.read
.csv(inputPath)
.drop("_c1", "_c5", "_c6", "_c7", "_c8")
.toDF("id", "timestamp", "x", "y")
.withColumn("x", ($"x" / 100.0f).cast(sql.types.FloatType))
.withColumn("y", ($"y" / 100.0f).cast(sql.types.FloatType))
// pick random ids as samples 1
val samples1Ids = samplesDF
.select($"id") // map to the id
.distinct
.rdd
.takeSample(withReplacement = false, 1000)
.map(r => r.getAs[String]("id"))
// broadcast it to the executor
val samples1IdsBC = sc broadcast samples1Ids
// get samples 1 and 2
val samples1DF = samplesDF
.where($"id" isin (samples1IdsBC.value: _*))
val samples2DF = samplesDF
.where(!($"id" isin (samples1IdsBC.value: _*)))
samples2DF
.withColumn("combined", struct("id", "lng", "lat"))
.groupBy("timestamp")
.agg(collect_list("combined").as("combined_list"))
.join(samples1DF, Seq("timestamp"), "rightouter")
.map {
case Row(timestamp: String, samples: mutable.WrappedArray[GenericRowWithSchema], sample1Id: String, sample1X: Float, sample1Y: Float) =>
val sample2Info = samples.filter {
case Row(_, sample2X: Float, sample2Y: Float) =>
Misc.isInRange((sample2X, sample2Y), (sample1X, sample1Y), 20)
case _ => false
}.map {
case Row(sample2Id: String, sample2X: Float, sample2Y: Float) =>
s"$sample2Id:$sample2X:$sample2Y"
case _ => ""
}.mkString(";")
(timestamp, sample1Id, sample1X, sample1Y, sample2Info)
case Row(timestamp: String, _, sample1Id: String, sample1X: Float, sample1Y: Float) => // no overlapping samples
(timestamp, sample1Id, sample1X, sample1Y, "")
case _ =>
("error", "", 0.0f, 0.0f, "")
}
.where($"_1" notEqual "error")
// .show(1000, truncate = false)
.write
.csv(outputPath)
Issue here is that normally spark commit tasks, jobs by renaming files, and on S3 renames are really, really slow. The more data you write, the longer it takes at the end of the job. That what you are seeing.
Fix: switch to the S3A committers, which don't do any renames.
Some tuning options to massively increase the number of threads in IO, commits and connection pool size
fs.s3a.threads.max from 10 to something bigger
fs.s3a.committer.threads -number files committed by a POST in parallel; default is 8
fs.s3a.connection.maximum + try (fs.s3a.committer.threads + fs.s3a.threads.max + 10)
These are all fairly small as many jobs work with multiple buckets and if there were big numbers for each it'd be really expensive to create an s3a client...but if you have many thousands of files, probably worthwhile.
Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For example:
def main(args: Array[String]) {
val logFile = "file.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
}
In this example I have to iterate 'logData` twice just to write results in two separate files:
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
It would be nice instead to have something like this:
val resultMap = logData.map(line => if line.contains("a") ("a", line) else if line.contains("b") ("b", line) else (" - ", line)
resultMap.writeByKey("a", "linesA.txt")
resultMap.writeByKey("b", "linesB.txt")
Any such thing?
Maybe something like this would work:
def singlePassMultiFilter[T](
rdd: RDD[T],
f1: T => Boolean,
f2: T => Boolean,
level: StorageLevel = StorageLevel.MEMORY_ONLY
): (RDD[T], RDD[T], Boolean => Unit) = {
val tempRDD = rdd mapPartitions { iter =>
val abuf1 = ArrayBuffer.empty[T]
val abuf2 = ArrayBuffer.empty[T]
for (x <- iter) {
if (f1(x)) abuf1 += x
if (f2(x)) abuf2 += x
}
Iterator.single((abuf1, abuf2))
}
tempRDD.persist(level)
val rdd1 = tempRDD.flatMap(_._1)
val rdd2 = tempRDD.flatMap(_._2)
(rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
}
Note that an action called on rdd1 (resp. rdd2) will cause tempRDD to be computed and persisted. This is practically equivalent to computing rdd2 (resp. rdd1) since the overhead of the flatMap in the definitions of rdd1 and rdd2 are, I believe, going to be pretty negligible.
You would use singlePassMultiFitler like so:
val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
rdd1.persist() //I'm going to need `rdd1` more later...
println(rdd1.count)
println(rdd2.count)
cleanUp(true) //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
println(rdd1.distinct.count)
Clearly this could extended to an arbitrary number of filters, collections of filters, etc.
Have a look at the following question.
Write to multiple outputs by key Spark - one Spark job
You can flatMap an RDD with a function like the following and then do a groupBy on the key.
def multiFilter(words:List[String], line:String) = for { word <- words; if line.contains(word) } yield { (word,line) }
val filterWords = List("a","b")
val filteredRDD = logData.flatMap( line => multiFilter(filterWords, line) )
val groupedRDD = filteredRDD.groupBy(_._1)
But depending on the size of your input RDD you may or not see any performance gains because any of groupBy operations involves a shuffle.
On the other hand if you have enough memory in your Spark cluster you can cache the input RDD and therefore running multiple filter operations may not be as expensive as you think.
I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:
val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
val fields = line.split(",")
Rating(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
Then I select 80% of those for my training set:
val train = ratings.sample(false,.8,1)
Is there an easy way to get the test set in a distributed way,
I am trying this but fails:
val test = ratings.filter(!_.equals(train.map(_)))
val test = ratings.subtract(train)
Take a look here. http://markmail.org/message/qi6srcyka6lcxe7o
Here is the code
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val partitionSeeds = data.partitions.map(partition => rand.nextLong)
val temp = data.mapPartitionsWithIndex((index, iter) => {
val partitionRand = new java.util.Random(partitionSeeds(index))
iter.map(x => (x, partitionRand.nextDouble))
})
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
}
Instead of using an exclusion method (like filter or subtract), I'd partition the set "by hand" for a more efficient execution:
val probabilisticSegment:(RDD[Double,Rating],Double=>Boolean) => RDD[Rating] =
(rdd,prob) => rdd.filter{case (k,v) => prob(k)}.map {case (k,v) => v}
val ranRating = rating.map( x=> (Random.nextDouble(), x)).cache
val train = probabilisticSegment(ranRating, _ < 0.8)
val test = probabilisticSegment(ranRating, _ >= 0.8)
cache saves the intermediate RDD sothat the next two operations can be performed from that point on without incurring in the execution of the complete lineage.
(*) Note the use of val to define a function instead of def. vals are serializer-friendly
I am using calliope i.e. spark plugin to connect with cassandra. I have created 2 RDDs which looks like
class A
val persistLevel = org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
val cas1 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 1")
val sc1 = new SparkContext("local", "name it any thing ")
var rdd1 = sc.cql3Cassandra[SCALACLASS_1](cas1)
var rddResult1 = rdd1.persist(persistLevel)
class B
val cas2 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 2")
var rdd2 = sc1.cql3Cassandra[SCALACLASS_2](cas2)
var rddResult2 = rdd2.persist(persistLevel)
somehow following code base which creates a new RDD using the other 2 is not working. Is it possible that we cannot iterate with 2 RDDs together?
Here is the code snippet which is not working -
case class Report(id: Long, anotherId: Long)
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**rddResult1.collect().toList**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
while if I replace the bold thing and initialize a val for it like -
val collection = rddResult1.collect().toList
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**collection**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
it works, is there any explaination?
You are mixing a transformation with an action. The closure of the rdd2.flatMap is executed on the workers, while rdd1.collect is an 'action' in Spark lingo and delivers data back to the driver. So, informally, you could say that the data is not there when you try to flatMap over it. (I don't know enough of the internals -yet- to pin-point the exact root-cause)
If you want to operate on both RDDs distributedly, you should join them using one of the join functions (join, leftOuterJoin, rightOuterJoin, cogroup).
E.g.
val mappedRdd1 = rdd1.map(x=> (x.id,x))
val mappedRdd2 = rdd2.map(x=> (x.customerId, x))
val joined = mappedRdd1.join(mappedRdd2)
joined.flatMap(...reporting logic..).collect
You can operate on RDDs in the application. But you cannot operate on RDDs in the executors (the worker nodes). The executors cannot give commands to drive the cluster. The code inside flatMap runs on the executors.
In the first case, you try to operate on an RDD in the executor. I reckon you would get a NotSerializableException as you cannot even send the RDD object to the executors. In the second case, you pull the RDD contents to the application, and then send this simple List to the executors. (Lambda captures are automatically serialized.)