I'm writing a Scala program to read objects matching a certain prefix on S3.
At the moment, I'm testing it on my Macbook Pro and it takes 270ms (avg. over 1000 trials) to hit S3, retrieve the 10 objects (avg. size of object 150Kb) and process it to print the output.
Here's my code:
val myBucket = "my-test-bucket"
val myPrefix = "t"
val startTime = System.currentTimeMillis()
//Can I make listObject parallel?
val listObjRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(myBucket)
val listObjResult: Seq[String] = s3.listObjects(listObjRequest).getObjectSummaries.par.toIndexedSeq.map(_.getKey).filter(_ matches s"./.*${myPrefix}.*/*")
//Can I make forEach parallel?
listObjResult foreach println //Could be any function
println(s"Total time: ${System.currentTimeMillis() - startTime}ms")
In the big scheme of things, I've got to sift through 50Gb of data (approx. 350K nested objects) and delete objects following a certain prefix (approx. 40K objects).
Hardware considerations aside, what can I do to optimize my code?
Thanks!
A possible solution would be to batch the request objects and send a request for batch deletion in S3. You can group the objects to delete and then parallalize the mapping over the parallel collection:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.DeleteObjectsRequest.KeyVersion
import com.amazonaws.services.s3.model.{DeleteObjectsRequest, DeleteObjectsResult}
import scala.collection.JavaConverters._
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.util.Try
object AmazonBatchDeletion {
def main(args: Array[String]): Unit = {
val filesToDelete: List[String] = ???
val numOfGroups: Int = ???
val deletionAttempts: Iterator[Future[Try[DeleteObjectsResult]]] =
filesToDelete
.grouped(numOfGroups)
.map(groupToDelete => Future {
blocking {
deleteFilesInBatch(groupToDelete, "bucketName")
}
})
val result: Future[Iterator[Try[DeleteObjectsResult]]] =
Future.sequence(deletionAttempts)
// TODO: make sure deletion was successful.
// Recover if needed form faulted futures.
}
def deleteFilesInBatch(filesToDelete: List[String],
bucketName: String): Try[DeleteObjectsResult] = {
val amazonClient = new AmazonS3Client()
val deleteObjectsRequest = new DeleteObjectsRequest(bucketName)
deleteObjectsRequest.setKeys(filesToDelete.map(new KeyVersion(_)).asJava)
Try {
amazonClient.deleteObjects(deleteObjectsRequest)
}
}
}
Related
I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks!
The coding language is scala.
If you want concurrent reading/processing of RDDs, you could leverage scala.concurrent.Future (or effects in ZIO, Cats etc).
Sample code for loading function is below:
def load(paths: Seq[String], spark: SparkSession)
(implicit ec: ExecutionContext): Seq[Future[RDD[String]]] = {
def loadSinglePath(path: String): Future[RDD[String]] = Future {
spark.sparkContext.textFile(path)
}
paths map loadSinglePath
}
Sample code for using this function:
import scala.concurrent.duration.{Duration, DurationInt}
val sc = SparkSession.builder.master("local[*]").getOrCreate()
implicit val ec = ExecutionContext.global
val result = load(Seq("t1.txt", "t2.txt", "t3.txt"), sc).zipWithIndex
.map { case (rddFuture, idx) =>
rddFuture.map( rdd =>
println(s"Rdd with index $idx has ${rdd.count()}")
)
}
Await.result(Future.sequence(result), 1 hour)
For example purposes, default global ExecutionContext is provided, but it could be configurable to run your code inside the custom one (just replace this implicit val ec with your own ExecutionContext)
My objective is to run a number of spark ml regression models (1000s of times) on one dataset and I want to do this using zio instead of future, because it is running too slow. Below is the working example of using Future.
A distinct list of keys is used to filter the partitioned dataset on key and run the model on. I've set up a thread pool with 8 executors to manage it, but it quickly degrades in performance.
import scala.concurrent.{Await, ExecutionContext, ExecutionContextExecutorService, Future}
import java.util.concurrent.{Executors, TimeUnit}
import scala.concurrent.duration._
import org.apache.spark.sql.SaveMode
val pool = Executors.newFixedThreadPool(8)
implicit val xc: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(pool)
case class Result(key: String, coeffs: String)
try {
import spark.implicits._
val tasks = {
for (x <- keys)
yield Future {
Seq(
Result(
x.group,
runModel(input.filter(col("group")===x)).mkString(",")
)
).toDS()
.write.mode(SaveMode.Overwrite).option("header", false).csv(
s"hdfs://namenode:8020/results/$x.csv"
)
}
}.toSeq
Await.result(Future.sequence(tasks), Duration.Inf)
}
finally {
pool.shutdown()
pool.awaitTermination(Long.MaxValue, TimeUnit.NANOSECONDS)
}
I've tried to implement this in zio, but I don't know how to implement queues and set a limit of executors like in futures.
Below is my failed attempt so far...
import zio._
import zio.console._
import zio.stm._
import org.apache.spark.sql.{Dataset, SaveMode, SparkSession}
import org.apache.spark.sql.functions.col
//example data/signatures
case class ModelResult(key: String, coeffs: String)
case class Data(key: String, sales: Double)
val keys: Array[String] = Array("100_1", "100_2")
def runModel[T](ds: Dataset[T]): Vector[Double]
object MyApp1 extends App {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
val input: Dataset[Data] = Seq(Data("100_1", 1d), Data("100_2", 2d)).toDS
def run(args: List[String]): ZIO[ZEnv, Nothing, Int] = {
for {
queue <- Queue.bounded[Int](8)
_ <- ZIO.foreach(1 to 8) (i => queue.offer(i)).fork
_ <- ZIO.foreach(keys) { k => queue.take.flatMap(_ => readWrite(k, input, queue)) }
} yield 0
}
def writecsv(k: String, v: String) = {
Seq(ModelResult(k, v))
.toDS
.write
.mode(SaveMode.Overwrite).option("header", value = false)
.csv(s"hdfs://namenode:8020/results/$k.csv")
}
def readWrite[T](key: String, ds: Dataset[T], queue: Queue[Int]): ZIO[ZEnv, Nothing, Int] = {
(for {
result <- runModel(ds.filter(col("key")===key)).mkString(",")
_ <- writecsv(key, result)
_ <- queue.offer(1)
_ <- putStrLn(s"successfully wrote output for $key")
} yield 0)
}
}
//to run
MyApp1.run(List[String]())
What is the best way to deal with compute this in zio?
To parallelize some workload across, say, 8 threads all you need is
ZIO.foreachParN(8)(1 to 100)(id => zio.blocking.blocking(Task{yourClusterJob(id)}))
But don't expect lots of a boost by switching from Futures to ZIO here:
1) Actual workload dominates coordination overhead so difference between ZIO and Future should be marginal.
2) Maybe you won't get any boost at all because 8 tasks will be fighting for the same resource pool in the Spark cluster.
I am trying to understand how i should be working with Source.queue & Sink.queue in Akka streaming.
In the little test program that I wrote below I find that I am able to successfully offer 1000 items to the Source.queue.
However, when i wait on the future that should give me the results of pulling all those items off the queue, my
future never completes. Specifically, the message 'print what we pulled off the queue' that we should see at the end
never prints out -- instead we see the error "TimeoutException: Futures timed out after [10 seconds]"
any guidance greatly appreciated !
import akka.actor.ActorSystem
import akka.event.{Logging, LoggingAdapter}
import akka.stream.scaladsl.{Flow, Keep, Sink, Source}
import akka.stream.{ActorMaterializer, Attributes}
import org.scalatest.FunSuite
import scala.collection.immutable
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
class StreamSpec extends FunSuite {
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
implicit val log: LoggingAdapter = Logging(actorSystem.eventStream, "basis-test")
implicit val ec: ExecutionContext = actorSystem.dispatcher
case class Req(name: String)
case class Response(
httpVersion: String = "",
method: String = "",
url: String = "",
headers: Map[String, String] = Map())
test("put items on queue then take them off") {
val source = Source.queue[String](128, akka.stream.OverflowStrategy.backpressure)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes( Attributes.inputBuffer(128, 128))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
(1 to 1000).map( i =>
Future {
println("offerd" + i) // I see this print 1000 times as expected
sourceQueue.offer(s"batch-$i")
}
)
println("DONE OFFER FUTURE FIRING")
// Now use the Sink.queue to pull the items we added onto the Source.queue
val seqOfFutures: immutable.Seq[Future[Option[String]]] =
(1 to 1000).map{ i => sinkQueue.pull() }
val futureOfSeq: Future[immutable.Seq[Option[String]]] =
Future.sequence(seqOfFutures)
val seq: immutable.Seq[Option[String]] =
Await.result(futureOfSeq, 10.second)
// unfortunately our future times out here
println("print what we pulled off the queue:" + seq);
}
}
Looking at this again, I realize that I originally set up and posed my question incorrectly.
The test that accompanies my original question launches a wave
of 1000 futures, each of which tries to offer 1 item to the queue.
Then the second step in that test attempts create a 1000-element sequence (seqOfFutures)
where each future is trying to pull a value from the queue.
My theory as to why I was getting time-out errors is that there was some kind of deadlock due to running
out of threads or due to one thread waiting on another but where the waited-on-thread was blocked,
or something like that.
I'm not interested in hunting down the exact cause at this point because I have corrected
things in the code below (see CORRECTED CODE).
In the new code the test that uses the queue is called:
"put items on queue then take them off (with async parallelism) - (3)".
In this test I have a set of 10 tasks which run in parallel to do the 'enequeue' operation.
Then I have another 10 tasks which do the dequeue operation, which involves not only taking
the item off the list, but also calling stringModifyFunc which introduces a 1 ms processing delay.
I also wanted to prove that I got some performance benefit from
launching tasks in parallel and having the task steps communicate by passing their results through a
queue, so test 3 runs as a timed operation, and I found that it takes 1.9 seconds.
Tests (1) and (2) do the same amount of work, but serially -- The first with no intervening queue, and the second
using the queue to pass results between steps. These tests run in 13.6 and 15.6 seconds respectively
(which shows that the queue adds a bit of overhead, but that this is overshadowed by the efficiencies of running tasks in parallel.)
CORRECTED CODE
import akka.{Done, NotUsed}
import akka.actor.ActorSystem
import akka.event.{Logging, LoggingAdapter}
import akka.stream.scaladsl.{Flow, Keep, Sink, Source}
import akka.stream.{ActorMaterializer, Attributes, QueueOfferResult}
import org.scalatest.FunSuite
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
class Speco extends FunSuite {
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
implicit val log: LoggingAdapter = Logging(actorSystem.eventStream, "basis-test")
implicit val ec: ExecutionContext = actorSystem.dispatcher
val stringModifyFunc: String => String = element => {
Thread.sleep(1)
s"Modified $element"
}
def setup = {
val source = Source.queue[String](128, akka.stream.OverflowStrategy.backpressure)
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(128, 128))
val (sourceQueue, sinkQueue) = source.toMat(sink)(Keep.both).run()
val offers: Source[String, NotUsed] = Source(
(1 to iterations).map { i =>
s"item-$i"
}
)
(sourceQueue,sinkQueue,offers)
}
val outer = 10
val inner = 1000
val iterations = outer * inner
def timedOperation[T](block : => T) = {
val t0 = System.nanoTime()
val result: T = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) / (1000 * 1000) + " milliseconds")
result
}
test("20k iterations in single threaded loop no queue (1)") {
timedOperation{
(1 to iterations).foreach { i =>
val str = stringModifyFunc(s"tag-${i.toString}")
System.out.println("str:" + str);
}
}
}
test("20k iterations in single threaded loop with queue (2)") {
timedOperation{
val (sourceQueue, sinkQueue, offers) = setup
val resultFuture: Future[Done] = offers.runForeach{ str =>
val itemFuture = for {
_ <- sourceQueue.offer(str)
item <- sinkQueue.pull()
} yield (stringModifyFunc(item.getOrElse("failed")) )
val item = Await.result(itemFuture, 10.second)
System.out.println("item:" + item);
}
val result = Await.result(resultFuture, 20.second)
System.out.println("result:" + result);
}
}
test("put items on queue then take them off (with async parallelism) - (3)") {
timedOperation{
val (sourceQueue, sinkQueue, offers) = setup
def enqueue(str: String) = sourceQueue.offer(str)
def dequeue = {
sinkQueue.pull().map{
maybeStr =>
val str = stringModifyFunc( maybeStr.getOrElse("failed2"))
println(s"dequeud value is $str")
}
}
val offerResults: Source[QueueOfferResult, NotUsed] =
offers.mapAsyncUnordered(10){ string => enqueue(string)}
val dequeueResults: Source[Unit, NotUsed] = offerResults.mapAsyncUnordered(10){ _ => dequeue }
val runAll: Future[Done] = dequeueResults.runForeach(u => u)
Await.result(runAll, 20.second)
}
}
}
I have a large query that seems to be a prime candidate for streaming results.
I would like to make a call to a function, which returns an object which I can apply additional map transformations on, and then ultimately convert the entire result into a list. This is because the conversions will results in a set of objects much smaller than the results in the database and there are many different transformations that must take place sequentially. Processing each result at a time will save me significant memory.
For example, if the results from the database were a stream (though the correct thing is likely an AkkaStream or an Iteratee), then I could do something like:
def outer(converter1[String, Int}, converter2[Int,Double]) {
val sqlIterator = getSqlIterator()
val mappedIterator1 = sqlIterator.map(x => converter1(x.bigColumn))
val mappedIterator2 = sqlIterator.map(x => converter2(x))
val retVal = mappedIterator.toList
retVal
}
def getSqlIterator() {
val selectedObjects = SQL( """SELECT * FROM table""").map { x =>
val id = x[Long]("id")
val tinyColumn = x[String]("tiny_column")
val bigColumn = x[String]("big_column")
NewObject(id, tinyColumn, bigColumn)
}
val transformed = UNKNOWN_FUNCTION(selectedObjects)
transformed
}
Most of the documentation appears to provide the mechanism to apply a "reduce" function to the results, rather than a "map" function, but the resulting mapped functions will be much smaller, saving me significant memory. What should I do for UNKNOWN_FUNCTION?
The following is a simple example of using Anorm's Akka Streams support to read the values from a single column of type String, applying two transformations to each element, and placing the results in a Seq. I'll leave it as an exercise for you to retrieve the values from multiple columns at a time, if that's what you need.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import anorm._
import scala.collection.immutable.Seq
import scala.concurrent.Future
implicit val system = ActorSystem("MySystem")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher
val convertStringToInt: String => Int = ???
val convertIntToDouble: Int => Double = ???
val result: Future[Seq[Double]] =
AkkaStream.source(SQL"SELECT big_column FROM table", SqlParser.scalar[String])
.map(convertStringToInt)
.map(convertIntToDouble)
.runWith(Sink.seq[Double])
I have a custom Aggregator that does a count-min sketch aggregation. It works, but is slow (code below). I get similar slow performance if I use a custom UDAF based on the UserDefinedAggregateFunction class.
This is much faster if I use the Dataset mapPartitions API to aggregate within a partition and then reduce across partitions.
Question - the slowness of the UDAF and Aggregator APIs seem to be caused by the serialization/deserialization (encoding) that happens at each row. Are the UDAF and Aggregator APIs not meant to be used to aggregate into non-trivial data structures like the count-min sketch? Is the mapPartitions approach the best way to handle this?
Aggregator sample code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Encoder, Row, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.util.sketch.CountMinSketch
object AggTest extends App {
val input = "2008.csv"
val conf = new SparkConf().setMaster("local[4]").setAppName("tester")
val sqlContext = SparkSession.builder().config(conf).getOrCreate().sqlContext
val df = sqlContext.read.format("csv").option("header", "true").load(input)
implicit val sketchEncoder = org.apache.spark.sql.Encoders.kryo[CountMinSketch]
case class SketchAgg(col: String) extends Aggregator[Row, CountMinSketch, CountMinSketch] {
def zero: CountMinSketch = CountMinSketch.create(500, 4, 2429)
def reduce(sketch: CountMinSketch, row: Row) = {
val a = row.getAs[Any](col)
sketch.add(a)
sketch
}
def merge(sketch1: CountMinSketch, sketch2: CountMinSketch) = {
sketch1.mergeInPlace(sketch2)
}
def finish(sketch: CountMinSketch) = sketch
def bufferEncoder: Encoder[CountMinSketch] = sketchEncoder
def outputEncoder: Encoder[CountMinSketch] = sketchEncoder
}
val sketch = df.agg(SketchAgg("ArrDelay")
.toColumn
.alias("sketch"))
.select("sketch")
.as[CountMinSketch]
.first()
}