Akka Stream dynamic Sink depending on Message from Kafka topic - scala

I have a Kafka consumer that reads Message. Each Message has an ID and content.
case class Message(id: String, content: String)
Depending on the ID, I want to write the Message into a separate sink. Specifically into a MongoDB collection. The Mongo provides a Sink that writes it to the DB into the specified collection.
val sink: Sink[Document, Future[Done]] = MongoSink.insertOne(collection(id))
The Problem is, i need to specify the sink when Connecting the Kafka Consumer Source, but each element defines into which sink it should go.
Is there a way to dynamically use a specific sink when an element arrives. Or is this not possible and I should for example, use a different Kafka topic for each id and connect each source to a separate sink?

It's not totally clear how the types line up in your example (e.g. the relationship between Document and Message), but there are a couple of approaches you can take:
If there are a lot of possible collections and they can't be known in advance, then the least bad option in Akka Streams is going to be along the lines of
Sink.foreachAsync[Message](parallelism) { msg =>
val document = documentFromMessage(msg)
val collection = collection(msg.id)
Source.single(document).runWith(MongoSink.insertOne(collection))
}
Note that this will use a new Mongo sink for every message, which may have efficiency concerns. Note that if there's a lighter weight way (e.g. in the reactivemongo driver?) which returns a Future after inserting a single document, but uses something like a connection pool to reduce overhead for single-document inserts, that will probably be preferable.
If the collections are known beforehand, you can prebuild sinks for each collection and use Partition and the GraphDSL to define a sink which incorporates the prebuilt sinks
// collection0, etc. are predefined and encompass all of the collections which might be returned by collection(id)
val collections: Map[MongoCollection[Document], (Int, Sink[Document, Future[Done]])] = Map(
collection0 -> (0 -> MongoSink.insertOne(collection0)),
collection1 -> (1 -> MongoSink.insertOne(collection1)),
collection2 -> (2 -> MongoSink.insertOne(collection2)),
collection3 -> (3 -> MongoSink.insertOne(collection3))
)
val combinedSink = Sink.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val partition = builder.add(
Partition[Message](
collections.size,
{ msg => collections(collection(msg.id))._1 }
)
)
val toDocument = Flow[Message].map(documentFromMessage)
collections.foreach {
case (_, (n, sink)) =>
partition.out(n) ~> toDocument ~> sink
}
SinkShape.of(partition.in)
}

Related

Running out of memory when loading large amount of records from database

I am using slick in Akka Streams to load large number of records (~2M) from the database (postgresql) and write them to an S3 File. However, I'm noticing that my code below works for records around ~50k but fails for anything over around 100k mark.
val allResults: Future[Seq[MyEntityImpl]] =
MyRepository.getAllRecordss()
val results: Future[MultipartUploadResult] = Source
.fromFuture(allResults)
.map(seek => seek.toList)
.mapConcat(identity)
.map(myEntity => myEntity.toPSV + "\n")
.map(s => ByteString(s))
.runWith(s3Sink)
Below is a sample of how myEntity looks like:
case class MyEntityImpl(partOne: MyPartOne, partTwo: MyPartTwo) {
def toPSV: String = myPartOne.toPSV + myPartTwo.toPSV
}
case class MyPartOne(field1: String, field2: String) {
def toPSV: String = {s"$field1|"+s"$field2"}
}
case class MyPartOne(field1: String, field2: String) {
def toPSV: String = {s"$field1|"+s"$field2"}
}
I am looking for a way to do this in a more reactive way so that it doesn't run out of memory.
Underlying Problem
The problem is that you are pulling all of the records from the database into local memory before dispatching them to the s3Sink.
The first place that the data is being pulled into memory is likely in your MyRepository.getAllRecords() method. Most, if not all, of the Seq implementations are in-memory based. The second place where you're definitely utilizing local memory is in seek.toList because a List is storing all of the data in-memory.
Solution
Instead of returning a Seq from getAllRecords you should be returning a slick-based akka Source directly. This will ensure that your materialized stream will only need memory for the transient processing steps before going to s3.
If your method definition changed to:
def getAllRecords() : Source[MyEntityImpl, _]
Then the rest of the stream would operate in a reactive manner:
MyRepository
.getAllRecords()
.map(myEntity => myEntity.toPSV + "\n")
.map(ByteString.apply)
.runWith(s3Sink)

compare each individual incoming record from input topic with its individual precede record

I am new in Kafka stream, my use case is to compare the value of each individual incoming record from input topic with value from its individual precede record, and, if a comparison condition is true then send new record containing a comparison result with an index of each compared records into result topic, else do not send anything. (note that all incoming records may have a unique Key for each record or null key).
Doing this in Kafka consumer and producer API is very easy, but ((without using external DB to store the preceding record)) I try to use (Kafka streams DSL API) only, (which include KTable and KStream, with them internal methods such as, aggregate, reduce .. etc) but maybe because I am a beginner so I did not find a clear way to access (internal state store) in these API in order to store state and retrieve the previous record state in order to compare it with current one, then after store the current record instead of old one in order to compare it with the next incoming record. A few approaches try to use Processor API instead of Stream DSL API, but it includes much complexity and I did not fully understand it. This why I try to solve my problem with Stream DSL API. But till now I did not succeed, unfortunately.
Actually, until now I did not succeed. can you help me, please by providing me with detailed code example to get this done using Kafka Stream DSL?
You can use Processor API.
You have to implement Transformer interface method transform that:
Will lookup value for key,
1.1 if not present, put value in store and finished
1.2 if present, calculate value, save new value in store and passed result to output topic
Sample code:
object SampleApp extends App {
val storeName: String = ???
val builder: StreamsBuilder = new StreamsBuilder()
builder.stream("topicName")(Consumed.`with`(Serdes.String(), Serdes.String()))
.transform[String, String](() => SampleTransformer(storeName))
.to("outputTopic")(Produced.`with`(Serdes.String(), Serdes.String()))
}
case class SampleTransformer(storeName: String)
extends Transformer[String, String, KeyValue[String, String]]
with LazyLogging {
var store: KeyValueStore[String, String] = _
override def init(context: ProcessorContext): Unit = {
super.init(context)
store = context.getStateStore(storeName).asInstanceOf[KeyValueStore[String, String]]
}
override def transform(key: String, newValue: String): String = {
val valueToPass = Option(store.get(key)).map(oldValue => someComputation(oldValue, newValue))
store.put(key, newValue)
valueToPass.orNull
}
def someComputation(oldValue: String, newValue: String): String = ???
override def close(): Unit = {
// Close
}
}

How to create an Akka flow with backpressure and Control

I need to create a function with the following Interface:
import akka.kafka.scaladsl.Consumer.Control
object ItemConversionFlow {
def build(config: StreamConfig): Flow[Item, OtherItem, Control] = {
// Implementation goes here
}
My problem is that I don't know how to define the flow in a way that it fits the interface above.
When I am doing something like this
val flow = Flow[Item]
.map(item => doConversion(item)
.filter(_.isDefined)
.map(_.get)
the resulting type is Flow[Item, OtherItem, NotUsed]. I haven't found something in the Akka documentation so far. Also the functions on akka.stream.scaladsl.Flow only offer a "NotUsed" instead of Control. Would be great if someone could point me into the right direction.
Some background: I need to setup several pipelines which only distinguish in the conversion part. These pipelines are sub streams to a main stream which might be stopped for some reason (a corresponding message arrives in some kafka topic). Therefor I need the Control part. The idea would be to create a Graph template where I just insert the mentioned flow as argument (a factory returning it). For a specific case we have a solution which works. To generalize it I need this kind of flow.
You actually have backpressure. However, think about what do you really need about backpressure... you are not using asynchronous stages to increase your throughput... for example. Backpressure avoids fast producers overgrowing susbscribers https://doc.akka.io/docs/akka/2.5/stream/stream-rate.html. In your sample don´t worry about it, your stream will ask for new elements to he publisher depending on how long doConversion takes to complete.
In case that you want to obtain the result of the stream use toMat or viaMat. For example, if your stream emits Item and transform these into OtherItem:
val str = Source.fromIterator(() => List(Item(Some(1))).toIterator)
.map(item => doConversion(item))
.filter(_.isDefined)
.map(_.get)
.toMat(Sink.fold(List[OtherItem]())((a, b) => {
// Examine the result of your stream
b :: a
}))(Keep.right)
.run()
str will be Future[List[OtherItem]]. Try to extrapolate this to your case.
Or using toMat with KillSwitches, "Creates a new [[Graph]] of [[FlowShape]] that materializes to an external switch that allows external completion
* of that unique materialization. Different materializations result in different, independent switches."
def build(config: StreamConfig): Flow[Item, OtherItem, UniqueKillSwitch] = {
Flow[Item]
.map(item => doConversion(item))
.filter(_.isDefined)
.map(_.get)
.viaMat(KillSwitches.single)(Keep.right)
}
val stream =
Source.fromIterator(() => List(Item(Some(1))).toIterator)
.viaMat(build(StreamConfig(1)))(Keep.right)
.toMat(Sink.ignore)(Keep.both).run
// This stops the stream
stream._1.shutdown()
// When it finishes
stream._2 onComplete(_ => println("Done"))

What are good models to introduce concurrency for database inserts using actor model

More details:
I am new to Scala and Akka.
I am trying to build a concurrent system that does this essentially-
Read a CSV file
Parse it into groups
And then load into table.
The file cannot be split into smaller files and hence I am going with a normal standard serialized read. I pass the info to a Masterwriter(an actor). I dynamically create n number of actors called writers and pass them chunks of this info. Each writer is now actually responsible for reading the data, categorize them and then insert into appropriate table.
My doubt is that when two writers are writing concurrently onto the table, will it lead to a race condition. Also, how else could this problem be modeled in a better way to increase speed. Any help in any direction would be really useful. Thanks
Modelling the Data Access
I have found that the biggest key to designing this sort of task is to abstract away the database. You should treat any database updates as simple function that returns success or failure:
type UpdateResult = Boolean
val UpdateSuccess : UpdateResult = true
val UpdateFailure : UpdateResult = false
type Data = ???
type Updater = (Data) => UpdateResult
This allows you to write an Updater that goes to an actual db or an test updater that always returns success:
val statement : Statement = ???
val dbUpdater : Updater = (data) => {
statement.executeQuery(s"INSERT INTO ... ${data.toString}")
}
val testUpdater : Updater = _ => UpdateSuccess
Akka Stream Implementation
For this particular use case I recommend akka streams instead of raw Actors. A solution using the stream paradigm can be found here.
Akka Actor
An Actor solution is also possible:
val UpdateActor(updater : Updater) extends Actor {
override def receive = {
case data : Data => sender ! updater(data)
}
}
The problem with Actors is that you'll have to write an Actor to read the file, other Actors to group the rows, and finally use the UpdateActor to send data to the db. You'll also have to wire all of those Actors together...

Akka Stream - Splitting flow into multiple Sources

I have a TCP connection in Akka Stream that ends in a Sink. Right now all messages go into one Sink. I want to split the stream into an unknown number of Sinks given some function.
The use case is as follows, from the TCP connection I get en continuous stream of something like List[DeltaValue], now I want to create an actorSink for each DeltaValue.id so that i can continuously accumulate and implement behaviour for each DeltaValue.id. I find this to be a standard use case in stream processing but I'm not able to find a good example with Akka Stream.
This is what I have right now:
def connect(): ActorRef = tcpConnection
.//SOMEHOW SPLIT HERE and create a ReceiverActor for each message
.to(Sink.actorRef(system.actorOf(ReceiverActor.props(), ReceiverActor.name), akka.Done))
.run()
Update:
I now have this, not sure what to say about it, it does not feel super stable but it should work:
private def spawnActorOrSendMessage(m: ResponseMessage): Unit = {
implicit val timeout = Timeout(FiniteDuration(1, TimeUnit.SECONDS))
system.actorSelection("user/" + m.id.toString).resolveOne().onComplete {
case Success(actorRef) => actorRef ! m
case Failure(ex) => (system.actorOf(ReceiverActor.props(), m.id.toString)) ! m
}
}
def connect(): ActorRef = tcpConnection
.to(Sink.foreachParallel(10)(spawnActorOrSendMessage))
.run()
The below should be a somewhat improved version of what was updated in the question. The main improvement is that your actors are kept in a data structure to avoid actorSelection resolution for every incoming message.
case class DeltaValue(id: String, value: Double)
val src: Source[DeltaValue, NotUsed] = ???
src.runFold(Map[String, ActorRef]()){
case (actors, elem) if actors.contains(elem.id) ⇒
actors(elem.id) ! elem.value
actors
case (actors, elem) ⇒
val newActor = system.actorOf(ReceiverActor.props(), ReceiverActor.name)
newActor ! elem.value
actors.updated(elem.id, newActor)
}
Keep in mind that, when you integrate Akka Streams with bare actors, you lose backpressure support. This is one of the reasons why you should try and implement your logic within the boundaries of Akka Streams whenever possible. And this is not always possible - e.g. when remoting is needed etc.
In your case, you could consider leveraging groupBy and the concept of substream. The example below is folding the elements of each substream by summing them, just to give an idea:
src.groupBy(maxSubstreams = Int.MaxValue, f = _.id)
.fold("" → 0d) {
case ((id, acc), delta) ⇒ id → delta.value + acc
}
.mergeSubstreams
.runForeach(println)
EventStream
You can send messages to the ActorSystem's EventStream within a stream sink and separately have the Actors subscribe to the stream.
Split At Stream Level
You can split the stream at the stream level using Broadcast. The documentation has a good example of this.
Split At Actor Level
You could also use Sink.actorRef in combination with a BroadcastPool to broadcast the messages to multiple Actors.