This question already has answers here:
How to configure a fine tuned thread pool for futures?
(4 answers)
Closed 3 years ago.
I am a newbie to Scala. I have a general query on future concepts of Scala.
Say I have a list of elements and foreach element present in the list i have to invoke a method which does some processing.
We can use future method and can do our processing in parallel but my question is how can we control that concurrent processing tasks running in parallel/background.
For example I should maintain the parallel running task limit as 10. So at Max my future should spawn processing for 10 elements in the list and wait for any of the spawned process to complete. Once free slots available it should spawn the process for remaining elements till it reach max.
I searched in Google but could not able to find it. In Unix same can be done by running process in background and manually check count using ps command. Since not aware of Scala much. Please help me in this.
Thanks in advance.
Let us create two thread pools of different sizes:
val fiveThreadsEc = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(5))
val tenThreadsEc = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
We can control on which thread pool will future run by passing it as an argument to the future like so
Future(42)(tenThreadsEc)
This is equivalent to
Future.apply(body = 42)(executor = tenThreadsEc)
which corresponds to the signature of Future.apply
def apply[T](body: => T)(implicit executor: ExecutionContext): Future[T] =
Note how the executor parameter is declared as implicit. This means we could provide it implicitly like so
implicit val tenThreadsEc = ...
Future(42) // executor = tenThreadsEc argument passed in magically
Now, as per Luis' suggestion, consider simplified signature of Future.traverse
def traverse[A, B, M[X] <: IterableOnce[X]](in: M[A])(fn: A => Future[B])(implicit ..., executor: ExecutionContext): Future[M[B]]
Let us simplify it further by fixing M type constructor parameter to, say a M = List,
def traverse[A, B]
(in: List[A]) // list of things to process in parallel
(fn: A => Future[B]) // function to process an element asynchronously
(implicit executor: ExecutionContext) // thread pool to use for parallel processing
: Future[List[B]] // returned result is a future of list of things instead of list of future things
Let's pass in the arguments
val tenThreadsEc = ...
val myList: List[Int] = List(11, 42, -1)
def myFun(x: Int)(implicit executor: ExecutionContext): Future[Int] = Future(x + 1)(ec)
Future.traverse[Int, Int, List](
in = myList)(
fn = myFun(_)(executor = tenThreadsEc))(
executor = tenThreadsEc,
bf = implicitly // ignore this
)
Relying on implicit resolution and type inference, we have simply
implicit val tenThreadsEc = ...
Future.traverse(myList)(myFun)
Putting it all together, here is a working example
import java.util.concurrent.Executors
import scala.concurrent.{ExecutionContext, Future}
object FuturesExample extends App {
val fiveThreadsEc = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(5))
val tenThreadsEc = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
val myList: List[Int] = List(11, 42, -1)
def myFun(x: Int)(implicit executor: ExecutionContext): Future[Int] = Future(x + 1)(executor)
Future(body = 42)(executor = fiveThreadsEc)
.andThen(v => println(v))(executor = fiveThreadsEc)
Future.traverse[Int, Int, List](
in = myList)(
fn = myFun(_)(executor = tenThreadsEc))(
executor = tenThreadsEc,
bf = implicitly
).andThen(v => println(v))(executor = tenThreadsEc)
// Using implicit execution context call-site simplifies to...
implicit val ec = tenThreadsEc
Future(42)
.andThen(v => println(v))
Future.traverse(myList)(myFun)
.andThen(v => println(v))
}
which outputs
Success(42)
Success(List(12, 43, 0))
Success(42)
Success(List(12, 43, 0))
Alternatively, Scala provides default execution context called
scala.concurrent.ExecutionContext.Implicits.global
and we can control its parallelism with system properties
scala.concurrent.context.minThreads
scala.concurrent.context.numThreads
scala.concurrent.context.maxThreads
scala.concurrent.context.maxExtraThreads
For example, create the following ConfiguringGlobalExecutorParallelism.scala
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object ConfiguringGlobalExecutorParallelism extends App {
println(scala.concurrent.ExecutionContext.Implicits.global.toString)
Future.traverse(List(11,42,-1))(x => Future(x + 1))
.andThen(v => println(v))
}
and run it with
scala -Dscala.concurrent.context.numThreads=10 -Dscala.concurrent.context.maxThreads=10 ConfiguringGlobalExecutorParallelism.scala
which should output
scala.concurrent.impl.ExecutionContextImpl$$anon$3#cb191ca[Running, parallelism = 10, size = 0, active = 0, running = 0, steals = 0, tasks = 0, submissions = 0]
Success(List(12, 43, 0))
Note how parallelism = 10.
Another option is to use parallel collections
libraryDependencies += "org.scala-lang.modules" %% "scala-parallel-collections" % "0.2.0"
and configure parallelism via tasksupport, for example
val myParVector: ParVector[Int] = ParVector(11, 42, -1)
myParVector.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
myParVector.map(x => x + 1)
Note that parallel collections are a separate facility from Futures
parallel collection design in Scala has no notion of an
ExecutionContext, that is strictly a property of Future. The parallel
collection library has a notion of a TaskSupport which is responsible
for scheduling inside the parallel collection
so we could map over the collection simply with x => x + 1 instead of x => Future(x + 1), and there was no need to use Future.traverse, instead just a regular map was sufficient.
Related
I am new to Scala and I have some questions about how it works.
I want to do the next thing : given list of values, I want to construct some imitation of dictionary in parallel, something like that: (1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) ). I know that if we deal with parallelized collections we should use accumulators. So here is my attempt:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.AccumulatorV2
import scala.collection.mutable.ListBuffer
class DictAccumulatorV2 extends AccumulatorV2[Int, ListBuffer[(Int, Int)]] {
private var dict:ListBuffer[(Int, Int)]= new ListBuffer[(Int, Int)]
def reset(): Unit = {
dict.clear()
}
def add(v: Int): Unit = {
dict.append((v, v))
}
def value():ListBuffer[(Int, Int)] = {
return dict
}
def isZero(): Boolean = {
return dict.isEmpty
}
def copy() : AccumulatorV2[Int, ListBuffer[(Int, Int)]] = {
// I do not understand how to code it correctly
return new DictAccumulatorV2
}
def merge(other:AccumulatorV2[Int, ListBuffer[(Int, Int)]]): Unit = {
// I do not understand how to code it correctly without reinitializing dict from val to var
dict = dict ++ other.value
}
}
object FirstSparkApplication {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MyFirstApp").setMaster("local")
val sc = new SparkContext(conf)
val accum = new DictAccumulatorV2()
sc.register(accum, "mydictacc")
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
var res = distData.map(x => accum.add(x))
res.count()
println(accum)
}
}
So I wonder if I do it right or there are any mistakes.
In general I also have questions about how sc.parallelize works. Does it actually parallelize job on my machine or it's just fictional string of code? What should I put instead of "local" in setMaster? How can I see on which nodes is the task been performing? Is the task performed on the all of the nodes at the same time or there is some sequence?
(1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) )
You can do this in Scala by doing
val list = List(1,2,3,4)
val dict = list.map(i => (i,i))
Spark Accumulators are used as a communication means from Spark executor to Driver.
If you want to do the above in Parallel, then you would construct an RDD out of this list and applying map transformation to it like shown above.
In spark shell it would look like
val list = List(1,2,3,4)
val listRDD = sc.parallelize(list)
val dictRDD = listRDD.map(i => (i,i))
how sc.parallelize works
It creates a distributed Dataset (RDD in spark terms) using the collection that you pass in to the function. More information.
It does parallelize your job.
If you are submitting your spark job to a cluster then you should be able to see a YARN application ID or URL after running spark-submit command.You can visit the YARN application URL and see how many executors are processing that distributed dataset and what sequence they are performed in.
What should I put instead of "local" in setMaster
From the Spark documentation -
The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
Is it possible to run multiple queries in parallel, using Doobie?
I have the following (pseudo)queries:
def prepareForQuery(input: String): ConnectionIO[Unit] = ???
val gettAllResults: ConnectionIO[List[(String, BigDecimal)]] = ???
def program(input : String) : ConnectionIO[List[(String, BigDecimal)]] = for{
_ <- prepareForQuery(input)
r <- gettAllResults
} yield r
What I tried is the following:
import doobie._
import doobie.implicits._
import cats.implicits._
val xa = Transactor.fromDataSource[IO](myDataSource)
val result = (program(i1),program(i2)).parMapN{case (a,b) => a ++ b}
val rs = result.transact(xa).unsafeRunSync
However, no NonEmptyParallel instance is found for ConnectionIO.
Error:(107, 54) could not find implicit value for parameter p:
cats.NonEmptyParallel[doobie.ConnectionIO,F] val result =
(program(i1),program(i2)).parMapN{case (a ,b) => a ++ b}
Am I missing something obvious or trying something that cannot be done?
Thanks
You cannot run queries in the ConnectionIO monad in parallel. But as soon as you turn them into your actual runtime monad (as long as it has a Parallel instance), you can.
For example, with the cats-effect IO runtime monad:
def prepareForQuery(input: String): ConnectionIO[Unit] = ???
val gettAllResults: ConnectionIO[List[(String, BigDecimal)]] = ???
def program(input : String) : ConnectionIO[List[(String, BigDecimal)]] = for{
_ <- prepareForQuery(input)
r <- gettAllResults
} yield r
Turn your ConnectionIO into an IO
val program1IO: IO[List[(String, BigDecimal)]]] = program(i1).transact(xa)
val program2IO: IO[List[(String, BigDecimal)]]] = program(i2).transact(xa)
You now have a monad which can do things in parallel.
val result: IO[List[(String, BigDecimal)]]] =
(program1IO, program2IO).parMapN{case (a,b) => a ++ b}
To understand why ConnectionIO doesn't allow you to do things in parallel, I'll just quote tpolecat:
You can't run ConnectionIO in parallel. It's a language describing the use of a connection which is a linear sequence of operations.
Using parMapN in IO, yes, you can run two things at the same time because they're running on different connections.
There is no parMapN with ConnectionIO because it does not (and cannot) have a Parallel instance.
I need to do something really similar to this https://github.com/typesafehub/activator-akka-stream-scala/blob/master/src/main/scala/sample/stream/GroupLogFile.scala
my problem is that I have an unknown number of groups and if the number of parallelism of the mapAsync is less of the number of groups i got and error in the last sink
Tearing down
SynchronousFileSink(/Users/sam/dev/projects/akka-streams/target/log-ERROR.txt)
due to upstream error
(akka.stream.impl.StreamSubscriptionTimeoutSupport$$anon$2)
I tried to put a buffer in the middle as suggested in the pattern guide of akka streams http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html
groupBy {
case LoglevelPattern(level) => level
case other => "OTHER"
}.buffer(1000, OverflowStrategy.backpressure).
// write lines of each group to a separate file
mapAsync(parallelism = 2) {....
but with the same result
Expanding on jrudolph's comment which is completely correct...
You do not need a mapAsync in this instance. As a basic example, suppose you have a source of tuples
import akka.stream.scaladsl.{Source, Sink}
def data() = List(("foo", 1),
("foo", 2),
("bar", 1),
("foo", 3),
("bar", 2))
val originalSource = Source(data)
You can then perform a groupBy to create a Source of Sources
def getID(tuple : (String, Int)) = tuple._1
//a Source of (String, Source[(String, Int),_])
val groupedSource = originalSource groupBy getID
Each one of the grouped Sources can be processed in parallel with just a map, no need for anything fancy. Here is an example of each grouping being summed in an independent stream:
import akka.actor.ActorSystem
import akka.stream.ACtorMaterializer
implicit val actorSystem = ActorSystem()
implicit val mat = ActorMaterializer()
import actorSystem.dispatcher
def getValues(tuple : (String, Int)) = tuple._2
//does not have to be a def, we can re-use the same sink over-and-over
val sumSink = Sink.fold[Int,Int](0)(_ + _)
//a Source of (String, Future[Int])
val sumSource =
groupedSource map { case (id, src) =>
id -> {src map getValues runWith sumSink} //calculate sum in independent stream
}
Now all of the "foo" numbers are being summed in parallel with all of the "bar" numbers.
mapAsync is used when you have a encapsulated function that returns a Future[T] and you're trying to emit a T instead; which is not the case in you question. Further, mapAsync involves waiting for results which is not reactive...
I've setup Spark core project from https://github.com/apache/spark.git. I've invoked one of the test classes : CacheManagerSuite and it passes.
How to run some Spark transformations/actions on the source? What class/object do I need to invoke within Spark project source in order to run below : ?
scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))
scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15
Spark core project contains unit tests which make it clearer how parallelize & reduce methods are called & implemented.
In org.apache.spark.util.ClosureCleanerSuite there is call to TestClassWithoutDefaultConstructor
org.apache.spark.util.TestClassWithoutDefaultConstructor calls to parallelize & reduce methods of Spark :
class TestClassWithoutDefaultConstructor(x: Int) extends Serializable {
def getX = x
def run(): Int = {
var nonSer = new NonSerializable
withSpark(new SparkContext("local", "test")) { sc =>
val nums = sc.parallelize(Array(1, 2, 3, 4))
nums.map(_ + getX).reduce(_ + _)
}
}
}
Similarity org.apache.spark.rdd.PairRDDFunctionsSuite contains method calls to groupByKey
Above tests compile and run on local machine
To experiment with Spark, like in your quoted example, start bin/spark-shell.
I need to process multiple data values in parallel ("SIMD"). I can use the java.util.concurrent APIs (Executors.newFixedThreadPool()) to process several values in parallels using Future instances:
import java.util.concurrent.{Executors, Callable}
class ExecutorsTest {
private class Process(value: Int)
extends Callable[Int] {
def call(): Int = {
// Do some time-consuming task
value
}
}
val executorService = {
val threads = Runtime.getRuntime.availableProcessors
Executors.newFixedThreadPool(threads)
}
val processes = for (process <- 1 to 1000) yield new Process(process)
val futures = executorService.invokeAll(processes)
// Wait for futures
}
How do I do the same thing using Actors? I do not believe that I want to "feed" all of the processes to a single actor because the actor will then execute them sequentially.
Do I need to create multiple "processor" actors with a "dispatcher" actor that sends an equal number of processes to each "processor" actor?
If you just want fire-and-forget processing, why not use Scala futures?
import scala.actors.Futures._
def example = {
val answers = (1 to 4).map(x => future {
Thread.sleep(x*1000)
println("Slept for "+x)
x
})
val t0 = System.nanoTime
awaitAll(1000000,answers: _*) // Number is timeout in ms
val t1 = System.nanoTime
printf("%.3f seconds elapsed\n",(t1-t0)*1e-9)
answers.map(_()).sum
}
scala> example
Slept for 1
Slept for 2
Slept for 3
Slept for 4
4.000 seconds elapsed
res1: Int = 10
Basically, all you do is put the code you want inside a future { } block, and it will immediately return a future; apply it to get the answer (it will block until done), or use awaitAll with a timeout to wait until everyone is done.
Update: As of 2.11, the way to do this is with scala.concurrent.Future. A translation of the above code is:
import scala.concurrent._
import duration._
import ExecutionContext.Implicits.global
def example = {
val answers = Future.sequence(
(1 to 4).map(x => Future {
Thread.sleep(x*1000)
println("Slept for "+x)
x
})
)
val t0 = System.nanoTime
val completed = Await.result(answers, Duration(1000, SECONDS))
val t1 = System.nanoTime
printf("%.3f seconds elapsed\n",(t1-t0)*1e-9)
completed.sum
}
If you can use Akka, take a look at the ActorPool support: http://doc.akka.io/routing-scala
It lets you specify parameters about how many actors you want running in parallel and then dispatches work to those actors.