fs2.Stream hangs on taking twice - scala

Problem:
I want to repeatedly take some batches from the fs2.Stream provided by some third-party library and therefore abstract clients away from the fs2.Stream itself and give them simply F[List[Int]] batches as soon as they are ready.
Attempts:
I tried to use fs2.Stream::take and ran some examples.
I.
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
val r = for {
queue <- fs2.concurrent.Queue.unbounded[IO, Int]
stream = queue.dequeue
_ <- fs2.Stream.range(0, 1000).covaryAll[IO, Int].evalTap(queue.enqueue1).compile.drain
_ <- stream.take(10).compile.toList.flatTap(lst => IO(println(lst))).iterateWhile(_.nonEmpty)
} yield ()
r.unsafeRunSync()
It prints the very first batch List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and then hangs. I expected that all the batches from 0 to 1000 will be printed.
Keeping things a bit simpler here is
II.
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
val r = for {
queue <- fs2.concurrent.Queue.unbounded[IO, Int]
stream = queue.dequeue
_ <- fs2.Stream.range(0, 1000).covaryAll[IO, Int].evalTap(queue.enqueue1).compile.drain
_ <- stream.take(10).compile.toList.flatTap(lst => IO(println(lst)))
_ <- stream.take(20).compile.toList.flatTap(lst => IO(println(lst)))
} yield ()
r.unsafeRunSync()
The behavior is completely the same to I. Prints List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and then hangs.
Question:
Given an fs2.Stream[IO, Int] how to provide an effect IO[List[Int]] which iterates through consecutive batches provided by the stream when evaluated?

Well, you can not have an IO[List[X]] that represents multiple batches, that IO would be a single batch.
The best you can do is something like this:
def processByBatches(process: List[Int] => IO[Unit]): IO[Unit]
That is, your users will give you an operation to execute for each batch and you would give them an IO that will block the current fiber until the whole stream was consumed using that function.
And the simples way to implement such function would be:
def processByBatches(process: List[Int] => IO[Unit]): IO[Unit] =
getStreamFromThirdParty
.chunkN(n = ChunkSize)
.evalMap(chunk => process(chunk.toList))
.compile
.drain

Related

Schedule computation concurrently for all elements of the fs2.Stream

I have an fs2.Stream consisting of some elements (probably infinite) and I want to schedule some computation for all elements of the stream concurrently to each other. Here is what I tried
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)
val stream = for {
id <- fs2.Stream.emits(List(1, 2)).covary[IO]
_ <- fs2.Stream.awakeEvery[IO](1.second)
_ <- fs2.Stream.eval(IO(println(id)))
} yield ()
stream.compile.drain.unsafeRunSync()
The program output looks like
1
1
1
etc...
which is not what's expected. I'd like to interleave the scheduled computation for all of the elements of the original stream, but not wait until the first stream terminates (which never happens due to the infinite scheduling).
val str = for {
id <- Stream.emits(List(1, 5, 7)).covary[IO]
res = timer.sleep(id.second) >> IO(println(id))
} yield res
val stream = str.parEvalMapUnordered(5)(identity)
stream.compile.drain.unsafeRunSync()
or
val stream = Stream.emits(List(1, 5, 7))
.map { id =>
Stream.eval(timer.sleep(id.second) >> IO(println(id))) }
.parJoinUnbounded
stream.compile.drain.unsafeRunSync()
Accroding to hints given by #KrzysztofAtłasik and #LuisMiguelMejíaSuárez here is the solution I just came up with:
val originalStream = fs2.Stream.emits(List(1, 2))
val scheduledComputation = originalStream.covary[IO].map({ id =>
fs2.Stream.awakeEvery[IO](1.second).evalMap(_ => IO.delay(println(id)))
}).fold(fs2.Stream.empty.covaryAll[IO, Unit])((result, stream) => result.merge(stream)).flatten
The solution that #KrzysztofAtłasik proposed in the comment with interleaving
id <- fs2.Stream.emits(List(1, 2)).covary[IO] and _ <- fs2.Stream.awakeEvery[IO](1.second) also works, but it does not allow to schedule each element in its own way.
To schedule elements concurrently for elementValue seconds it is possible to do the following:
val scheduleEachElementIndividually = originalStream.covary[IO].map({ id =>
//id.seconds
fs2.Stream.awakeEvery[IO](id.second).evalMap(_ => IO.delay(println(id)))
}).fold(fs2.Stream.empty.covaryAll[IO, Unit])((result, stream) => result.merge(stream)).flatten

How to control the future concurrent in Scala? [duplicate]

This question already has answers here:
How to configure a fine tuned thread pool for futures?
(4 answers)
Closed 3 years ago.
I am a newbie to Scala. I have a general query on future concepts of Scala.
Say I have a list of elements and foreach element present in the list i have to invoke a method which does some processing.
We can use future method and can do our processing in parallel but my question is how can we control that concurrent processing tasks running in parallel/background.
For example I should maintain the parallel running task limit as 10. So at Max my future should spawn processing for 10 elements in the list and wait for any of the spawned process to complete. Once free slots available it should spawn the process for remaining elements till it reach max.
I searched in Google but could not able to find it. In Unix same can be done by running process in background and manually check count using ps command. Since not aware of Scala much. Please help me in this.
Thanks in advance.
Let us create two thread pools of different sizes:
val fiveThreadsEc = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(5))
val tenThreadsEc = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
We can control on which thread pool will future run by passing it as an argument to the future like so
Future(42)(tenThreadsEc)
This is equivalent to
Future.apply(body = 42)(executor = tenThreadsEc)
which corresponds to the signature of Future.apply
def apply[T](body: => T)(implicit executor: ExecutionContext): Future[T] =
Note how the executor parameter is declared as implicit. This means we could provide it implicitly like so
implicit val tenThreadsEc = ...
Future(42) // executor = tenThreadsEc argument passed in magically
Now, as per Luis' suggestion, consider simplified signature of Future.traverse
def traverse[A, B, M[X] <: IterableOnce[X]](in: M[A])(fn: A => Future[B])(implicit ..., executor: ExecutionContext): Future[M[B]]
Let us simplify it further by fixing M type constructor parameter to, say a M = List,
def traverse[A, B]
(in: List[A]) // list of things to process in parallel
(fn: A => Future[B]) // function to process an element asynchronously
(implicit executor: ExecutionContext) // thread pool to use for parallel processing
: Future[List[B]] // returned result is a future of list of things instead of list of future things
Let's pass in the arguments
val tenThreadsEc = ...
val myList: List[Int] = List(11, 42, -1)
def myFun(x: Int)(implicit executor: ExecutionContext): Future[Int] = Future(x + 1)(ec)
Future.traverse[Int, Int, List](
in = myList)(
fn = myFun(_)(executor = tenThreadsEc))(
executor = tenThreadsEc,
bf = implicitly // ignore this
)
Relying on implicit resolution and type inference, we have simply
implicit val tenThreadsEc = ...
Future.traverse(myList)(myFun)
Putting it all together, here is a working example
import java.util.concurrent.Executors
import scala.concurrent.{ExecutionContext, Future}
object FuturesExample extends App {
val fiveThreadsEc = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(5))
val tenThreadsEc = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
val myList: List[Int] = List(11, 42, -1)
def myFun(x: Int)(implicit executor: ExecutionContext): Future[Int] = Future(x + 1)(executor)
Future(body = 42)(executor = fiveThreadsEc)
.andThen(v => println(v))(executor = fiveThreadsEc)
Future.traverse[Int, Int, List](
in = myList)(
fn = myFun(_)(executor = tenThreadsEc))(
executor = tenThreadsEc,
bf = implicitly
).andThen(v => println(v))(executor = tenThreadsEc)
// Using implicit execution context call-site simplifies to...
implicit val ec = tenThreadsEc
Future(42)
.andThen(v => println(v))
Future.traverse(myList)(myFun)
.andThen(v => println(v))
}
which outputs
Success(42)
Success(List(12, 43, 0))
Success(42)
Success(List(12, 43, 0))
Alternatively, Scala provides default execution context called
scala.concurrent.ExecutionContext.Implicits.global
and we can control its parallelism with system properties
scala.concurrent.context.minThreads
scala.concurrent.context.numThreads
scala.concurrent.context.maxThreads
scala.concurrent.context.maxExtraThreads
For example, create the following ConfiguringGlobalExecutorParallelism.scala
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object ConfiguringGlobalExecutorParallelism extends App {
println(scala.concurrent.ExecutionContext.Implicits.global.toString)
Future.traverse(List(11,42,-1))(x => Future(x + 1))
.andThen(v => println(v))
}
and run it with
scala -Dscala.concurrent.context.numThreads=10 -Dscala.concurrent.context.maxThreads=10 ConfiguringGlobalExecutorParallelism.scala
which should output
scala.concurrent.impl.ExecutionContextImpl$$anon$3#cb191ca[Running, parallelism = 10, size = 0, active = 0, running = 0, steals = 0, tasks = 0, submissions = 0]
Success(List(12, 43, 0))
Note how parallelism = 10.
Another option is to use parallel collections
libraryDependencies += "org.scala-lang.modules" %% "scala-parallel-collections" % "0.2.0"
and configure parallelism via tasksupport, for example
val myParVector: ParVector[Int] = ParVector(11, 42, -1)
myParVector.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
myParVector.map(x => x + 1)
Note that parallel collections are a separate facility from Futures
parallel collection design in Scala has no notion of an
ExecutionContext, that is strictly a property of Future. The parallel
collection library has a notion of a TaskSupport which is responsible
for scheduling inside the parallel collection
so we could map over the collection simply with x => x + 1 instead of x => Future(x + 1), and there was no need to use Future.traverse, instead just a regular map was sufficient.

Count number of elements in Akka Streams

I'm trying to transform a Source of Scala entities into a Source of ByteString via Alpakka's CsvFormatting and count number of elements in the initial stream. Could you suggest the best way to count the initialSource elements and keep the result as a ByteString Source:
val initialSource: Source[SomeEntity, NotUsed] = Source.fromPublisher(publisher)
val csvSource: Source[ByteString, NotUsed] = initialSource
.map(e => List(e.firstName, e.lastName, e.city))
.via(CsvFormatting.format())
To count the elements in a stream, one must run the stream. One approach is to broadcast the stream elements to two sinks: one sink is the result of the main processing, the other sink simply counts the number of elements. Here is a simple example, which uses a graph to obtain the materialized values of both sinks:
val sink1 = Sink.foreach(println)
val sink2 = Sink.fold[Int, ByteString](0)((acc, _) => acc + 1)
val g = RunnableGraph.fromGraph(GraphDSL.create(sink1, sink2)((_, _)) { implicit builder =>
(s1, s2) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[ByteString](2))
val source: Source[ByteString, NotUsed] =
Source(1 to 10)
.map(i => List(i.toString))
.via(CsvFormatting.format())
source ~> broadcast.in
broadcast.out(0) ~> s1
broadcast.out(1) ~> s2
ClosedShape
}) // RunnableGraph[(Future[Done], Future[Int])]
val (fut1, fut2) = g.run()
fut2 onComplete {
case Success(count) => println(s"Number of elements: $count")
case Failure(_) =>
}
In the above example, the first sink just prints the stream elements and has a materialized value of type Future[Done]. The second sink does a fold operation to count the stream elements and has a materialized value of type Future[Int]. The following is printed:
ByteString(49, 13, 10)
ByteString(50, 13, 10)
ByteString(51, 13, 10)
ByteString(52, 13, 10)
ByteString(53, 13, 10)
ByteString(54, 13, 10)
ByteString(55, 13, 10)
ByteString(56, 13, 10)
ByteString(57, 13, 10)
ByteString(49, 48, 13, 10)
Number of elements: 10
Another option for sending stream elements to two different sinks, while retaining their respective materialized values, is to use alsoToMat:
val sink1 = Sink.foreach(println)
val sink2 = Sink.fold[Int, ByteString](0)((acc, _) => acc + 1)
val (fut1, fut2) = Source(1 to 10)
.map(i => List(i.toString))
.via(CsvFormatting.format())
.alsoToMat(sink1)(Keep.right)
.toMat(sink2)(Keep.both)
.run() // (Future[Done], Future[Int])
fut2 onComplete {
case Success(count) => println(s"Number of elements: $count")
case Failure(_) =>
}
This produces the same result as the graph example described earlier.

How to count the number of occurences of an element with scala/spark?

I had a file that contained a list of elements like this
00|905000|20160125204123|79644809999||HGMTC|1||22|7905000|56321647569|||34110|I||||||250995210056537|354805064211510||56191|||38704||A|||11|V|81079681404134|5||||SE|||G|144|||||||||||||||Y|b00534589.huawei_anadyr.20151231184912||1|||||79681404134|0|||+##+1{79098509982}2{2}3{2}5{79644809999}6{0000002A7A5AC635}7{79681404134}|20160125|
Through a series of steps, I managed to convert it to a list of elements like this
(902996760100000,CompactBuffer(6, 5, 2, 2, 8, 6, 5, 3))
Where 905000 and 902996760100000 are keys and 6, 5, 2, 2, 8, 6, 5, 3 are values. Values can be numbers from 1 to 8. Are there any ways to count number of occurences of these values using spark, so the result looks like this?
(902996760100000, 0_1, 2_2, 1_3, 0_4, 2_5, 2_6, 0_7, 1_8)
I could do it with if else blocks and staff, but that won't be pretty, so I wondered if there are any instrumets I could use in scala/spark.
This is my code.
class ScalaJob(sc: SparkContext) {
def run(cdrPath: String) : RDD[(String, Iterable[String])] = {
//pass the file
val fileCdr = sc.textFile(cdrPath);
//find values in every raw cdr
val valuesCdr = fileCdr.map{
dataRaw =>
val p = dataRaw.split("[|]",-1)
(p(1), ScalaJob.processType(ScalaJob.processTime(p(2)) + "_" + p(32)))
}
val x = valuesCdr.groupByKey()
return x
}
Any advice on optimizing it would be appreciated. I'm really new to scala/spark.
First, Scala is a type-safe language and so is Spark's RDD API - so it's highly recommended to use the type system instead of going around it by "encoding" everything into Strings.
So I'll suggest a solution that creates an RDD[(String, Seq[(Int, Int)])] (with second item in tuple being a sequence of (ID, count) tuples) and not a RDD[(String, Iterable[String])] which seems less useful.
Here's a simple function that counts the occurrences of 1 to 8 in a given Iterable[Int]:
def countValues(l: Iterable[Int]): Seq[(Int, Int)] = {
(1 to 8).map(i => (i, l.count(_ == i)))
}
You can use mapValues with this function (place the function in the object for serializability, like you did with the rest) on an RDD[(String, Iterable[Int])] to get the result:
valuesCdr.groupByKey().mapValues(ScalaJob.countValues)
The entire solution can then be simplified a bit:
class ScalaJob(sc: SparkContext) {
import ScalaJob._
def run(cdrPath: String): RDD[(String, Seq[(Int, Int)])] = {
val valuesCdr = sc.textFile(cdrPath)
.map(_.split("\\|"))
.map(p => (p(1), processType(processTime(p(2)), p(32))))
valuesCdr.groupByKey().mapValues(countValues)
}
}
object ScalaJob {
val dayParts = Map((6 to 11) -> 1, (12 to 18) -> 2, (19 to 23) -> 3, (0 to 5) -> 4)
def processTime(s: String): Int = {
val hour = DateTime.parse(s, DateTimeFormat.forPattern("yyyyMMddHHmmss")).getHourOfDay
dayParts.filterKeys(_.contains(hour)).values.head
}
def processType(dayPart: Int, s: String): Int = s match {
case "S" => 2 * dayPart - 1
case "V" => 2 * dayPart
}
def countValues(l: Iterable[Int]): Seq[(Int, Int)] = {
(1 to 8).map(i => (i, l.count(_ == i)))
}
}

How to consume grouped sub streams with mapAsync in akka streams

I need to do something really similar to this https://github.com/typesafehub/activator-akka-stream-scala/blob/master/src/main/scala/sample/stream/GroupLogFile.scala
my problem is that I have an unknown number of groups and if the number of parallelism of the mapAsync is less of the number of groups i got and error in the last sink
Tearing down
SynchronousFileSink(/Users/sam/dev/projects/akka-streams/target/log-ERROR.txt)
due to upstream error
(akka.stream.impl.StreamSubscriptionTimeoutSupport$$anon$2)
I tried to put a buffer in the middle as suggested in the pattern guide of akka streams http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html
groupBy {
case LoglevelPattern(level) => level
case other => "OTHER"
}.buffer(1000, OverflowStrategy.backpressure).
// write lines of each group to a separate file
mapAsync(parallelism = 2) {....
but with the same result
Expanding on jrudolph's comment which is completely correct...
You do not need a mapAsync in this instance. As a basic example, suppose you have a source of tuples
import akka.stream.scaladsl.{Source, Sink}
def data() = List(("foo", 1),
("foo", 2),
("bar", 1),
("foo", 3),
("bar", 2))
val originalSource = Source(data)
You can then perform a groupBy to create a Source of Sources
def getID(tuple : (String, Int)) = tuple._1
//a Source of (String, Source[(String, Int),_])
val groupedSource = originalSource groupBy getID
Each one of the grouped Sources can be processed in parallel with just a map, no need for anything fancy. Here is an example of each grouping being summed in an independent stream:
import akka.actor.ActorSystem
import akka.stream.ACtorMaterializer
implicit val actorSystem = ActorSystem()
implicit val mat = ActorMaterializer()
import actorSystem.dispatcher
def getValues(tuple : (String, Int)) = tuple._2
//does not have to be a def, we can re-use the same sink over-and-over
val sumSink = Sink.fold[Int,Int](0)(_ + _)
//a Source of (String, Future[Int])
val sumSource =
groupedSource map { case (id, src) =>
id -> {src map getValues runWith sumSink} //calculate sum in independent stream
}
Now all of the "foo" numbers are being summed in parallel with all of the "bar" numbers.
mapAsync is used when you have a encapsulated function that returns a Future[T] and you're trying to emit a T instead; which is not the case in you question. Further, mapAsync involves waiting for results which is not reactive...