I am trying to understand better Akka Streams concept on the following example. Consider a bank account. It has a past transaction history and there will be new transactions coming. Now we want to use it as a source for an Akka stream. But its data will be used in 3 different scenarios:
A consumer app collects all past transactions and prints a report.
A consumer app is a transaction monitor that prints all new transaction starting with a time the app started.
A consumer app combines functions of (1) and (2): it first prints all past transactions and then prints all arriving transactions.
What do we have here in terms of Akka streams? Is the difference in stream sources that feed otherwise same flows and sinks with different data? Or is the source the same (it's all transactions from the same bank account) but we need to apply different filtering operations to obtain different results?
Akka stream Sources can be combined like any other Iterable that exists within scala.
Based on your example, say we have historic transactions that are persisted in a database. We could use something like slick streaming to get those transactions from the db:
val historicSource : Source[Transaction, _] = ???
There would also be realtime transactions (possibly coming from a messaging system):
val realtimeSource : Source[Transaction, _] = ???
These two Sources can be combined:
val combinedSource = historicSource ++ realtimeSource
Those combined events can then be used by the same stream processing logic; for example you could println any transaction over $1,000.00:
val isLargeTransaction = (_ : Transaction).dollarAmount > 1000.0
val reportTransaction = (transaction : Transaction) =>
println s"Large Transaction: $transaction"
combinedSource.filter(isLargeTransaction)
.runWith(Sink foreach reportTransaction)
Related
We have the following logic implemented to manage jobs targeting different backends:
A Manager Actor is started. This actor:
Loads the configuration required to target each backen (mutable map backend name -> backend connector configuration);
Loads a pool of Actors (RoundRobinPool) to handle the jobs for each backend (mutable map backend name -> RoundRobinPool Actor Ref)
When a request is received by the Manager actor, it retrieves the backend name from the message and forward it to the corresponding pool of Actor to handle the job (assuming a configuration for this backend was registered). The result of the job request is then returned from the actor to the original sender (reason why we use forward).
This logic works very well, but backend being slow to handle job, we are in a typical case of fast publisher, slow consumer and this is raising issues when the load increases.
After doing some research, Akka Streams seems the way to go as it allows to implement back pressure and throttling which would be perfect for our usage (for exemple, limit to 5 requests per seconds).
The idea is to keep the Manager Actor with the same routing logic but replace the pools of Actors with a Source.queue.
When registering the Source.queue, this would bed perform like this:
val queue = Source
.queue[RunBackendRequest](0, OverflowStrategy.backpressure)
.throttle(5, 1.second)
.map(r => runBackendRequest(r))
.toMat(Sink.ignore)(Keep.left)
.run())
Where the definition of RunBackendRequest is:
case class RunBackendRequest(originalSender: ActorRef, backendConnector: BackendConnector, request: BackendRequest)
And the function runBackendRequest is defined as such:
private def runBackendRequest(runRequest: RunBackendRequest): Unit = {
val connector = BackendConnectorFactory.getBackendConnector(configuration.underlying, runRequest.backendConnector.toConfig(), materializer, environment.asJava)
Future { connector.doSomeWork(runRequest.request) } map { result =>
runRequest.originalSender ! Success(result)
} recover {
case e: Exception => runRequest.originalSender ! Failure(e)
}
}
}
When the Manager Actor receive a message, it will 'offer' it to the correct queue based on the name of the target backend contained in the message.
Therefore, I have a few question:
Is this the correct way to use Akka Stream in this particular use case or could it we written differently and more efficiently?
Is that ok to provide the actorRef of the original sender in RunBackendRequest object so that the request will be answered in the Flow?
Is there a way to retrieve the result of the Flow into a Future instead and the Manager actor could then return the result of the request itself?
Akka Streams seems to be very powerful, but there is clearly a learning curve!
It feels to me having the Manager Actor creates a single point of failure. Maybe worth a try:
The original sender keeps hammering an Akka stream graph instead of the Manager actor. Make sure you pass the ActorRef downstream such that reply can be sent back
Inside the graph, using either partition-then-merge or Substreams to process requests that aim towards different backend connectors.
Either as the last step of the graph or after the backend connectors have finished, answer the original sender.
Overall, Colin's article is a great introduction on how to use Akka streams with Partition and Merge to archive your goal.
Let me know if you need more clarification and I can update my answer accordingly.
I have a mix-and-match Scala topology where the main worker is a PAPI processor, and other parts are connected through DSL.
EventsProcessor:
INPUT: eventsTopic
OUTPUT: visitorsTopic (and others)
Data throughout the topics (incl. original eventsTopic) is partitioned through a, let's call it DoubleKey that has two fields.
Visitors are sent to visitorsTopic through a Sink:
.addSink(VISITOR_SINK_NAME, visitorTopicName,
DoubleKey.getSerializer(), Visitor.getSerializer(), visitorSinkPartitioner, EVENT_PROCESSOR_NAME)
In the DSL, I create a KV KTable over this topic:
val visitorTable = builder.table(
visitorTopicName,
Consumed.`with`(DoubleKey.getKafkaSerde(),
Visitor.getKafkaSerde()),
Materialized.as(visitorStoreName))
which I later connect to the EventProcessor:
topology.connectProcessorAndStateStores(EVENT_PROCESSOR_NAME, visitorStoreName)
Everything is co-partitioned (via DoubleKey). visitorSinkPartitioner performs a typical modulo operation:
Math.abs(partitionKey.hashCode % numPartitions)
In the PAPI processor EventsProcessor, I query this table to see if there are existent Visitors already.
However, in my tests (using EmbeddedKafka, but that should not make a difference), if I run them with one partition, all is fine (the EventsProcessor checks KTable on two events on same DoubleKey, and on the second event - with some delay - it can see the existent Visitor on the store), but if I run it with a higher number, the EventProcessor never sees the value in the Store.
However if I check the store via API ( iterating store.all()), the record is there. So I understand it must be going to different partition.
Since the KTable should work on the data on its partition, and everything is sent to the same partition, (using explicit partitioners calling the same code), the KTable should get that data on the same partition.
Are my assumptions correct? What could be happening?
KafkaStreams 1.0.0, Scala 2.12.4.
PS. Of course it would work doing the puts on the PAPI creating the store through PAPI instead of StreamsBuilder.table(), since that would definitely use the same partition where the code runs, but that's out of the question.
Yes, the assumptions were correct.
In case it helps anyone:
I had a problem when passing the Partitioner to the Scala EmbeddedKafka library. In one of the tests suites it was not done right.
Now, following the everhealthy practice of refactoring, I have this method used in all the suites of this topology.
def getEmbeddedKafkaTestConfig(zkPort: Int, kafkaPort: Int) :
EmbeddedKafkaConfig = {
val producerProperties = Map(ProducerConfig.PARTITIONER_CLASS_CONFIG ->
classOf[DoubleKeyPartitioner].getCanonicalName)
EmbeddedKafkaConfig(kafkaPort = kafkaPort, zooKeeperPort = zkPort,
customProducerProperties = producerProperties)
}
An app that I am developing requires/gives the users the ability to create and define arbitrary streams at runtime. I understand that in Akka streams in particular
Materialisation = Execute or Run
My questions
1) Should materialisation for a stream be done only once? i.e if it is already materialised then can I use the value for subsequent runs?
2) As said above, maybe I misunderstood the term materialisation. If a stream has to run, it is materialised each time?
I am confused because in the docs, it says materialisation actually creates the resources needed for stream execution. So my immediate understanding is that it has to be done only once. Just like a JDBC connection to a database. Can someone please explain in a non-akka terminology.
Yes, a stream can be materialized multiple times. And yes, if a stream is run multiple times, it is materialized each time. From the documentation:
Since a stream can be materialized multiple times, the materialized value will also be calculated anew for each such materialization, usually leading to different values being returned each time. In the example below we create two running materialized instance of the stream that we described in the runnable variable, and both materializations give us a different Future from the map even though we used the same sink to refer to the future:
// connect the Source to the Sink, obtaining a RunnableGraph
val sink = Sink.fold[Int, Int](0)(_ + _)
val runnable: RunnableGraph[Future[Int]] =
Source(1 to 10).toMat(sink)(Keep.right)
// get the materialized value of the FoldSink
val sum1: Future[Int] = runnable.run()
val sum2: Future[Int] = runnable.run()
// sum1 and sum2 are different Futures!
Think of a stream as a reusable blueprint that can be run/materialized multiple times. A materializer is required to materialize a stream, and Akka Streams provides a materializer called ActorMaterializer. The materializer allocates the necessary resources (actors, etc.) and executes the stream. While it is common to use the same materializer for different streams and multiple materializations, each materialization of a stream triggers the resource allocation needed to run the stream. In the example above, sum1 and sum2 use the same blueprint (runnable) and the same materializer, but they are the results of distinct materializations that incurred distinct resource allocations.
Monix looks like great framework but documentation is very sparse.
What is alsoTo analogue of akka-streams in monix ?
Basically I want stream to be consumed by two consumers.
Monix follows the Rx model in that subscriptions are dynamic. Any Observable supports an unlimited number of subscribers:
val obs = Observable.interval(1.second)
val s1 = obs.dump("O1").subscribe()
val s2 = obs.dump("O2").subscribe()
There is a catch however — Observable is by default what is called a "cold data source", meaning that each subscriber gets its own data source.
So for example, if you had an Observable that reads from a File, then each subscriber would get its own file handle.
In order to "share" such an Observable between multiple subscribers, you have to convert it into a hot data source, to share it. You do so with the multicast operator and its versions, publish being most commonly used. These give you back a ConnectableObservable, that needs a connect() call to start the streaming:
val shared = obs.publish
// Nothing happens here:
val s1 = shared.dump("O1").subscribe()
val s2 = shared.dump("O2").subscribe()
// Starts actual streaming
val cancelable = shared.connect()
// You can subscribe after connect(), but you might lose events:
val s3 = shared.dump("O3").subscribe()
// You can unsubscribe one of your subscribers, but the
// data source keeps the stream active for the others
s1.cancel()
// To cancel the connection for all subscribers:
cancelable.cancel()
PS: monix.io is a work in progress, PRs are welcome 😀
I'm having a code that executing a pipeline using Akka streams.
My question is what is the best way of scale it out? Can it be done using Akka streams also?
Or it need to be converted into actors/other way?
The code snippet is:
val future = SqsSource(sqsEndpoint)(awsSqsClient)
.takeWhile(_=>true)
.map { m: Message =>
(m, Ack())
}.runWith(SqsAckSink(sqsEndpoint)(awsSqsClient))
If you modify your code a bit then your stream will be materialized into multiple Actor values. These materialized Actors will get you the concurrency you are looking for:
val future =
SqsSource(sqsEnpoint)(awsSqsClient) //Actor 1
.via(Flow[Message] map (m => (m, Ack()))) //Actor 2
.to(SqsAckSink(sqsEndpoint)(awsSqsClient)) //Actor 3
.run()
Note the use of via and to. These are important because they indicate that those stages of the stream should be materialized into separate Actors. In your example code you are using map and runWith on the Source which would result in only 1 Actor being created because of operator fusion.
Flows that Ask External Actors
If you're looking to extend to even more Actors then you can use Flow#mapAsync to query an external Actor to do more work, similar to this example.