I'm trying to build an example concerning using the Stream.concurrently method in fs2. I'm developing the producer/consumer pattern, using a Queue as the shared state:
import cats.effect.std.{Queue, Random}
object Fs2Tutorial extends IOApp {
val random: IO[Random[IO]] = Random.scalaUtilRandom[IO]
val queue: IO[Queue[IO, Int]] = Queue.bounded[IO, Int](10)
val producer: IO[Nothing] = for {
r <- random
q <- queue
p <-
r.betweenInt(1, 11)
.flatMap(q.offer)
.flatTap(_ => IO.sleep(1.second))
.foreverM
} yield p
val consumer: IO[Nothing] = for {
q <- queue
c <- q.take.flatMap { n =>
IO.println(s"Consumed $n")
}.foreverM
} yield c
val concurrently: Stream[IO, Nothing] = Stream.eval(producer).concurrently(Stream.eval(consumer))
override def run(args: List[String]): IO[ExitCode] = {
concurrently.compile.drain.as(ExitCode.Success)
}
}
I expect the program to print some "Consumed n", for some n. However, the program prints nothing to the console.
What's wrong with the above code?
What's wrong with the above code?
You are not using the same Queue in the consumer and in the producer, rather each of them is creating its own new independent Queue (the same happens with Random BTW)
This is a common mistake made by newbies who don't grasp yet the main principles behind a data type like IO
When you do val queue: IO[Queue[IO, Int]] = Queue.bounded[IO, Int](10) you are saying that queue is a program that when evaluated will produce a value of type Queue[IO, Unit], that is the point of all this.
The program become a value, and as any value you can manipulate it in any ways to produce new values, for example using flatMap so when both consumer & producer crate a new program by flatMapping queue they both create new independent programs / values.
You can fix that code like this:
import cats.effect.{IO, IOApp}
import cats.effect.std.{Queue, Random}
import cats.syntax.all._
import fs2.Stream
import scala.concurrent.duration._
object Fs2Tutorial extends IOApp.Simple {
override final val run: IO[Unit] = {
val resources =
(
Random.scalaUtilRandom[IO],
Queue.bounded[IO, Int](10)
).tupled
val concurrently =
Stream.eval(resources).flatMap {
case (random, queue) =>
val producer =
Stream
.fixedDelay[IO](1.second)
.evalMap(_ => random.betweenInt(1, 11))
.evalMap(queue.offer)
val consumer =
Stream.fromQueueUnterminated(queue).evalMap(n => IO.println(s"Consumed $n"))
producer.concurrently(consumer)
}
concurrently.interruptAfter(10.seconds).compile.drain >> IO.println("Finished!")
}
}
(You can see it running here).
PS: I would recommend you to look into the "Programs as Values" Series from Fabio Labella: https://systemfw.org/archive.html
Related
import org.slf4j.LoggerFactory
import zio.blocking.Blocking
import zio.clock.Clock
import zio.console.{Console, putStrLn}
import zio.kafka.consumer.{CommittableRecord, Consumer, ConsumerSettings, Subscription}
import zio.kafka.consumer.Consumer.{AutoOffsetStrategy, OffsetRetrieval}
import zio.kafka.serde.Serde
import zio.stream.ZStream
import zio.{ExitCode, Has, URIO, ZIO, ZLayer}
object Test2Topics extends zio.App {
val logger = LoggerFactory.getLogger(this.getClass)
val consumerSettings: ConsumerSettings =
ConsumerSettings(List("localhost:9092"))
.withGroupId(s"consumer-${java.util.UUID.randomUUID().toString}")
.withOffsetRetrieval(OffsetRetrieval.Auto(AutoOffsetStrategy.Earliest))
val consumer: ZLayer[Clock with Blocking, Throwable, Has[Consumer]] =
ZLayer.fromManaged(Consumer.make(consumerSettings))
val streamString: ZStream[Any with Has[Consumer], Throwable, CommittableRecord[String, String]] =
Consumer.subscribeAnd(Subscription.topics("test"))
.plainStream(Serde.string, Serde.string)
val streamInt: ZStream[Any with Has[Consumer], Throwable, CommittableRecord[String, String]] =
Consumer.subscribeAnd(Subscription.topics("topic"))
.plainStream(Serde.string, Serde.string)
val combined = streamString.zipWithLatest(streamInt)((a,b)=>(a,b))
val program = for {
fiber1 <- streamInt.tap(r => putStrLn(s"streamInt: ${r.toString}")).runDrain.forkDaemon
fiber2 <- streamString.tap(r => putStrLn(s"streamString: ${r.toString}")).runDrain.forkDaemon
} yield ZIO.raceAll(fiber1.join, List(fiber2.join))
override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
//combined.tap(r => putStrLn(s"Combined: ${r.toString}")).runDrain.provideSomeLayer(consumer ++ Console.live).exitCode
program.provideSomeLayer(consumer ++ Console.live).exitCode
}
}
Somehow when i try to combine the output from two topics with names test and topic i dont get any output printed out, and also when i try to print both streams in parallel that also doesnt work, but if i print just one stream at a time it works.
Did anyone experience anything like this?
You are composing 1 shared layer that provides one instance of a consumer and initialize this instance twice after eachother to subscribe to 2 topics one after the other.
A single consumer instance should only be initialized once, so the above code will never work.
I believe setting up 2 independent compositions of consumer to stream like this will help:
val program = for {
fiber1 <- streamInt.tap(r => putStrLn(s"streamInt: ${r.toString}")).runDrain.forkDaemon.provideSomeLayer(consumer)
fiber2 <- streamString.tap(r => putStrLn(s"streamString: ${r.toString}")).runDrain.forkDaemon.provideSomeLayer(consumer)
} yield {...}
I am quite new to Akka Streams, whereas I have some experience with Kafka Streams.
One thing it seems lacking in Akka Streams is the possibility to join together two different streams.
Kafka Streams allows joining information coming from two different streams (or tables) using the messages' keys.
Is there something similar in Akka Streams?
The short answer is unfortunately no. I would argue that Akka-streams is more low level than Kafka-Stream, Spark Streaming, or Flink. However, you have more control over what you are doing. Basically, it means that you can build your join operator. Check this discussion at lightbend.
Basically, you have to get data from 2 Sources, Merge them and send to a window based on time or number of tuples, compute the join, and emit the data to the Sink. I have done this PoC (which is still unfinished) but I follow the operators that I said to you here, and it is compiling and working. Basically, I still have to join the data inside the window. Currently, I am just emitting them in a mini-batch.
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream.{Attributes, ClosedShape, FlowShape, Inlet, Outlet}
import akka.stream.scaladsl.{Flow, GraphDSL, Merge, RunnableGraph, Sink, Source}
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler, TimerGraphStageLogic}
import scala.collection.mutable
import scala.concurrent.duration._
object StreamOpenGraphJoin {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("StreamOpenGraphJoin")
val incrementSource: Source[Int, NotUsed] = Source(1 to 10).throttle(1, 1 second)
val decrementSource: Source[Int, NotUsed] = Source(10 to 20).throttle(1, 1 second)
def tokenizerSource(key: Int) = {
Flow[Int].map { value =>
(key, value)
}
}
// Step 1 - setting up the fundamental for a stream graph
val switchJoinStrategies = RunnableGraph.fromGraph(
GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
// Step 2 - add partition and merge strategy
val tokenizerShape00 = builder.add(tokenizerSource(0))
val tokenizerShape01 = builder.add(tokenizerSource(1))
val mergeTupleShape = builder.add(Merge[(Int, Int)](2))
val batchFlow = Flow.fromGraph(new BatchTimerFlow[(Int, Int)](5 seconds))
val sinkShape = builder.add(Sink.foreach[(Int, Int)](x => println(s" > sink: $x")))
// Step 3 - tying up the components
incrementSource ~> tokenizerShape00 ~> mergeTupleShape.in(0)
decrementSource ~> tokenizerShape01 ~> mergeTupleShape.in(1)
mergeTupleShape.out ~> batchFlow ~> sinkShape
// Step 4 - return the shape
ClosedShape
}
)
// run the graph and materialize it
val graph = switchJoinStrategies.run()
}
// step 0: define the shape
class BatchTimerFlow[T](silencePeriod: FiniteDuration) extends GraphStage[FlowShape[T, T]] {
// step 1: define the ports and the component-specific members
val in = Inlet[T]("BatchTimerFlow.in")
val out = Outlet[T]("BatchTimerFlow.out")
// step 3: create the logic
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) {
// mutable state
val batch = new mutable.Queue[T]
var open = false
// step 4: define mutable state implement my logic here
setHandler(in, new InHandler {
override def onPush(): Unit = {
try {
val nextElement = grab(in)
batch.enqueue(nextElement)
Thread.sleep(50) // simulate an expensive computation
if (open) pull(in) // send demand upstream signal, asking for another element
else {
// forward the element to the downstream operator
emitMultiple(out, batch.dequeueAll(_ => true).to[collection.immutable.Iterable])
open = true
scheduleOnce(None, silencePeriod)
}
} catch {
case e: Throwable => failStage(e)
}
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
override protected def onTimer(timerKey: Any): Unit = {
open = false
}
}
// step 2: construct a new shape
override def shape: FlowShape[T, T] = FlowShape[T, T](in, out)
}
}
I am trying to understand how i should be working with Source.queue & Sink.queue in Akka streaming.
In the little test program that I wrote below I find that I am able to successfully offer 1000 items to the Source.queue.
However, when i wait on the future that should give me the results of pulling all those items off the queue, my
future never completes. Specifically, the message 'print what we pulled off the queue' that we should see at the end
never prints out -- instead we see the error "TimeoutException: Futures timed out after [10 seconds]"
any guidance greatly appreciated !
import akka.actor.ActorSystem
import akka.event.{Logging, LoggingAdapter}
import akka.stream.scaladsl.{Flow, Keep, Sink, Source}
import akka.stream.{ActorMaterializer, Attributes}
import org.scalatest.FunSuite
import scala.collection.immutable
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
class StreamSpec extends FunSuite {
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
implicit val log: LoggingAdapter = Logging(actorSystem.eventStream, "basis-test")
implicit val ec: ExecutionContext = actorSystem.dispatcher
case class Req(name: String)
case class Response(
httpVersion: String = "",
method: String = "",
url: String = "",
headers: Map[String, String] = Map())
test("put items on queue then take them off") {
val source = Source.queue[String](128, akka.stream.OverflowStrategy.backpressure)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes( Attributes.inputBuffer(128, 128))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
(1 to 1000).map( i =>
Future {
println("offerd" + i) // I see this print 1000 times as expected
sourceQueue.offer(s"batch-$i")
}
)
println("DONE OFFER FUTURE FIRING")
// Now use the Sink.queue to pull the items we added onto the Source.queue
val seqOfFutures: immutable.Seq[Future[Option[String]]] =
(1 to 1000).map{ i => sinkQueue.pull() }
val futureOfSeq: Future[immutable.Seq[Option[String]]] =
Future.sequence(seqOfFutures)
val seq: immutable.Seq[Option[String]] =
Await.result(futureOfSeq, 10.second)
// unfortunately our future times out here
println("print what we pulled off the queue:" + seq);
}
}
Looking at this again, I realize that I originally set up and posed my question incorrectly.
The test that accompanies my original question launches a wave
of 1000 futures, each of which tries to offer 1 item to the queue.
Then the second step in that test attempts create a 1000-element sequence (seqOfFutures)
where each future is trying to pull a value from the queue.
My theory as to why I was getting time-out errors is that there was some kind of deadlock due to running
out of threads or due to one thread waiting on another but where the waited-on-thread was blocked,
or something like that.
I'm not interested in hunting down the exact cause at this point because I have corrected
things in the code below (see CORRECTED CODE).
In the new code the test that uses the queue is called:
"put items on queue then take them off (with async parallelism) - (3)".
In this test I have a set of 10 tasks which run in parallel to do the 'enequeue' operation.
Then I have another 10 tasks which do the dequeue operation, which involves not only taking
the item off the list, but also calling stringModifyFunc which introduces a 1 ms processing delay.
I also wanted to prove that I got some performance benefit from
launching tasks in parallel and having the task steps communicate by passing their results through a
queue, so test 3 runs as a timed operation, and I found that it takes 1.9 seconds.
Tests (1) and (2) do the same amount of work, but serially -- The first with no intervening queue, and the second
using the queue to pass results between steps. These tests run in 13.6 and 15.6 seconds respectively
(which shows that the queue adds a bit of overhead, but that this is overshadowed by the efficiencies of running tasks in parallel.)
CORRECTED CODE
import akka.{Done, NotUsed}
import akka.actor.ActorSystem
import akka.event.{Logging, LoggingAdapter}
import akka.stream.scaladsl.{Flow, Keep, Sink, Source}
import akka.stream.{ActorMaterializer, Attributes, QueueOfferResult}
import org.scalatest.FunSuite
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
class Speco extends FunSuite {
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
implicit val log: LoggingAdapter = Logging(actorSystem.eventStream, "basis-test")
implicit val ec: ExecutionContext = actorSystem.dispatcher
val stringModifyFunc: String => String = element => {
Thread.sleep(1)
s"Modified $element"
}
def setup = {
val source = Source.queue[String](128, akka.stream.OverflowStrategy.backpressure)
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(128, 128))
val (sourceQueue, sinkQueue) = source.toMat(sink)(Keep.both).run()
val offers: Source[String, NotUsed] = Source(
(1 to iterations).map { i =>
s"item-$i"
}
)
(sourceQueue,sinkQueue,offers)
}
val outer = 10
val inner = 1000
val iterations = outer * inner
def timedOperation[T](block : => T) = {
val t0 = System.nanoTime()
val result: T = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) / (1000 * 1000) + " milliseconds")
result
}
test("20k iterations in single threaded loop no queue (1)") {
timedOperation{
(1 to iterations).foreach { i =>
val str = stringModifyFunc(s"tag-${i.toString}")
System.out.println("str:" + str);
}
}
}
test("20k iterations in single threaded loop with queue (2)") {
timedOperation{
val (sourceQueue, sinkQueue, offers) = setup
val resultFuture: Future[Done] = offers.runForeach{ str =>
val itemFuture = for {
_ <- sourceQueue.offer(str)
item <- sinkQueue.pull()
} yield (stringModifyFunc(item.getOrElse("failed")) )
val item = Await.result(itemFuture, 10.second)
System.out.println("item:" + item);
}
val result = Await.result(resultFuture, 20.second)
System.out.println("result:" + result);
}
}
test("put items on queue then take them off (with async parallelism) - (3)") {
timedOperation{
val (sourceQueue, sinkQueue, offers) = setup
def enqueue(str: String) = sourceQueue.offer(str)
def dequeue = {
sinkQueue.pull().map{
maybeStr =>
val str = stringModifyFunc( maybeStr.getOrElse("failed2"))
println(s"dequeud value is $str")
}
}
val offerResults: Source[QueueOfferResult, NotUsed] =
offers.mapAsyncUnordered(10){ string => enqueue(string)}
val dequeueResults: Source[Unit, NotUsed] = offerResults.mapAsyncUnordered(10){ _ => dequeue }
val runAll: Future[Done] = dequeueResults.runForeach(u => u)
Await.result(runAll, 20.second)
}
}
}
Apologies in advance for the basic question. I am starting to learn Scala with http4s and in a router handler, I am trying to enter an entry to MongoDB. As far as I can tell insertOne returns a Observable[Completed].
Any idea how I can wait for the observalbe to complete, before returning the response?
My code is:
class Routes {
val service: HttpService = HttpService {
case r # GET -> Root / "hello" => {
val mongoClient: MongoClient = MongoClient()
val database: MongoDatabase = mongoClient.getDatabase("scala")
val collection: MongoCollection[Document] = database.getCollection("tests")
val doc: Document = Document("_id" -> 0, "name" -> "MongoDB", "type" -> "database",
"count" -> 1, "info" -> Document("x" -> 203, "y" -> 102))
collection.insertOne(doc)
mongoClient.close()
Ok("Hello.")
}
}
}
class GomadApp(host: String, port: Int) {
private val pool = Executors.newCachedThreadPool()
println(s"Starting server on '$host:$port'")
val routes = new Routes().service
// Add some logging to the service
val service: HttpService = routes.local { req =>
val path = req.uri
val start = System.nanoTime()
val result = req
val time = ((System.nanoTime() - start) / 1000) / 1000.0
println(s"${req.remoteAddr.getOrElse("null")} -> ${req.method}: $path in $time ms")
result
}
// Construct the blaze pipeline.
def build(): ServerBuilder =
BlazeBuilder
.bindHttp(port, host)
.mountService(service)
.withServiceExecutor(pool)
}
object GomadApp extends ServerApp {
val ip = "127.0.0.1"
val port = envOrNone("HTTP_PORT") map (_.toInt) getOrElse (8787)
override def server(args: List[String]): Task[Server] =
new GomadApp(ip, port)
.build()
.start
}
I'd recommend https://github.com/haghard/mongo-query-streams - although you'll have to fork it and up the dependencies a bit, scalaz 7.1 and 7.2 aren't binary-compatible.
The less-streamy (and less referentially correct) way: https://github.com/Verizon/delorean
collection.insertOne(doc).toFuture().toTask.flatMap({res => Ok("Hello")})
The latter solution looks easier, but it has some hidden pitfalls. See https://www.reddit.com/r/scala/comments/3zofjl/why_is_future_totally_unusable/
This tweet made me wonder: https://twitter.com/timperrett/status/684584581048233984
Do you consider Futures "totally unusable" or is this just hyperbole? I've never had a major problem, but I'm willing to be enlightened. Doesn't the following code make Futures effectively "lazy"? def myFuture = Future { 42 }
And, finally, I've also heard rumblings that scalaz's Tasks have some failings as well, but I haven't found much on it. Anybody have more details?
Answer:
The fundamental problem is that constructing a Future with a side-effecting expression is itself a side-effect. You can only reason about Future for pure computations, which unfortunately is not how they are commonly used. Here is a demonstration of this operation breaking referential transparency:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.util.Random
val f1 = {
val r = new Random(0L)
val x = Future(r.nextInt)
for {
a <- x
b <- x
} yield (a, b)
}
// Same as f1, but I inlined `x`
val f2 = {
val r = new Random(0L)
for {
a <- Future(r.nextInt)
b <- Future(r.nextInt)
} yield (a, b)
}
f1.onComplete(println) // Success((-1155484576,-1155484576))
f2.onComplete(println) // Success((-1155484576,-723955400)) <-- not the same
However this works fine with Task. Note that the interesting one is the non-inlined version, which manages to produce two distinct Int values. This is the important bit: Task has a constructor that captures side-effects as values, and Future does not.
import scalaz.concurrent.Task
val task1 = {
val r = new Random(0L)
val x = Task.delay(r.nextInt)
for {
a <- x
b <- x
} yield (a, b)
}
// Same as task1, but I inlined `x`
val task2 = {
val r = new Random(0L)
for {
a <- Task.delay(r.nextInt)
b <- Task.delay(r.nextInt)
} yield (a, b)
}
println(task1.run) // (-1155484576,-723955400)
println(task2.run) // (-1155484576,-723955400)
Most of the commonly-cited differences like "a Task doesn't run until you ask it to" and "you can compose the same Task over and over" trace back to this fundamental distinction.
So the reason it's "totally unusable" is that once you're used to programming with pure values and relying on equational reasoning to understand and manipulate programs it's hard to go back to side-effecty world where things are much harder to understand.
I want to read multiple big files using Akka Streams to process each line. Imagine that each key consists of an (identifier -> value). If a new identifier is found, I want to save it and its value in the database; otherwise, if the identifier has already been found while processing the stream of lines, I want to save only the value. For that, I think that I need some kind of recursive stateful flow in order to keep the identifiers that have already been found in a Map. I think I'd receive in this flow a pair of (newLine, contextWithIdentifiers).
I've just started to look into Akka Streams. I guess I can manage myself to do the stateless processing stuff but I have no clue about how to keep the contextWithIdentifiers. I'd appreciate any pointers to the right direction.
Maybe something like statefulMapConcat can help you:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
import scala.util.Random._
import scala.math.abs
import scala.concurrent.ExecutionContext.Implicits.global
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
//encapsulating your input
case class IdentValue(id: Int, value: String)
//some random generated input
val identValues = List.fill(20)(IdentValue(abs(nextInt()) % 5, "valueHere"))
val stateFlow = Flow[IdentValue].statefulMapConcat{ () =>
//state with already processed ids
var ids = Set.empty[Int]
identValue => if (ids.contains(identValue.id)) {
//save value to DB
println(identValue.value)
List(identValue)
} else {
//save both to database
println(identValue)
ids = ids + identValue.id
List(identValue)
}
}
Source(identValues)
.via(stateFlow)
.runWith(Sink.seq)
.onSuccess { case identValue => println(identValue) }
A few years later, here is an implementation I wrote if you only need a 1-to-1 mapping (not 1-to-N):
import akka.stream.stage.{GraphStage, GraphStageLogic}
import akka.stream.{Attributes, FlowShape, Inlet, Outlet}
object StatefulMap {
def apply[T, O](converter: => T => O) = new StatefulMap[T, O](converter)
}
class StatefulMap[T, O](converter: => T => O) extends GraphStage[FlowShape[T, O]] {
val in = Inlet[T]("StatefulMap.in")
val out = Outlet[O]("StatefulMap.out")
val shape = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
val f = converter
setHandler(in, () => push(out, f(grab(in))))
setHandler(out, () => pull(in))
}
}
Test (and demo):
behavior of "StatefulMap"
class Counter extends (Any => Int) {
var count = 0
override def apply(x: Any): Int = {
count += 1
count
}
}
it should "not share state among substreams" in {
val result = await {
Source(0 until 10)
.groupBy(2, _ % 2)
.via(StatefulMap(new Counter()))
.fold(Seq.empty[Int])(_ :+ _)
.mergeSubstreams
.runWith(Sink.seq)
}
result.foreach(_ should be(1 to 5))
}