Akka stream best practice for dynamic Source and Flow controlled by websocket messages - scala

I'm trying to understand what is the best way to implement with akka stream and alpakka the following scenario (simplified):
The frontend opens a websocket connection with the backend
Backend should wait an initialization message with some parameters (for example bootstrapServers, topicName and a transformationMethod that is a string parameter)
Once these informations are in place, backend can start the alpakka consumer to consume from topic topicName from bootstrapServers and applying some transformation to the data based on transformationMethod, pushing these results inside the websocket
Periodically, frontend can send through the websocket messages that changes the transformationMethod field, so that the transformation algorithm of the messages consumed from Kafka can dynamically change, based on the value of transformationMethod provided into the websocket.
I don't understand if it's possible to achieve this on akka stream inside a Graph, especially the dynamic part, both for the initialization of the alpakka consumer and also for the dynamic changing of the transformationMethod parameter.
Example:
Frontend establish connection, and after 10 second it sends trough the socket the following:
{"bootstrapServers": "localhost:9092", "topicName": "topic", "transformationMethod": "PLUS_ONE"}
Because of that, Alpakka consumer is instantiated and starts reading messages from Kafka.
Messages are flowing in Kafka, so it arrives 1 and in the websocket the frontend will receive 2 (because of the PLUS_ONE transformation method, that is probably placed in a map or a via with a Flow), then 2 and so frontend receives 3 and so on.
Then, frontend sends:
{"transformationMethod": "SQUARE"}
So now, from Kafka arrives 3 and the frontend will receive 9, then 4 and so the output will be 16 ecc...
This is more or less the flow of what I would like to obtain.
I am able to create a websocket connection with Alpakka consumer that perform some sort of "static" transformations and push back the result to the websocket, it's straightforward, what I miss is this dynamic part but I'm not sure if i can implement that inside the same graph or if I need more layers (maybe with some Actor that manages the flow and will activate/change the behavior of the Alpakka consumer in real time sending messages?)
Thanks

I would probably tend to implement this by spawning an actor for each websocket, prematerializing a Source which will receive messages from the actor (probably using ActorSource.actorRefWithBackpressure), building a Sink (likely using ActorSink.actorRefWithBackpressure) which adapts incoming websocket messages into control-plane messages (initialization (including the ActorRef associated with the prematerialized source) and transformation changes) and sends them to the actor, and then tying them together using the handleMessagesWithSinkSource on WebsocketUpgrade.
The actor you're spawning would, on receipt of the initialization message, start a stream which is feeding messages to it from Kafka; some backpressure can be fed back to Kafka by having the stream feed messages via an ask protocol which waits for an ack; in order to keep that stream alive, the actor would need to ack within a certain period of time regardless of what the downstream did, so there's a decision to be made around having the actor buffer messages or drop them.

Related

Is it better to keep a Kafka Producer open or to create a new one for each message?

I have data coming in through RabbitMQ. The data is coming in constantly, multiple messages per second.
I need to forward that data to Kafka.
In my RabbitMQ delivery callback where I am getting the data from RabbitMQ I have a Kafka producer that immediately sends the recevied messages to Kafka.
My question is very simple. Is it better to create a Kafka producer outside of the callback method and use that one producer for all messages or should I create the producer inside the callback method and close it after the message is sent, which means that I am creating a new producer for each message?
It might be a naive question but I am new to Kafka and so far I did not find a definitive answer on the internet.
EDIT : I am using a Java Kafka client.
Creating a Kafka producer is an expensive operation, so using Kafka producer as a singleton will be a good practice considering performance and utilizing resources.
For Java clients, this is from the docs:
The producer is thread safe and should generally be shared among all threads for best performance.
For librdkafka based clients (confluent-dotnet, confluent-python etc.), I can link this related issue with this quote from the issue:
Yes, creating a singleton service like that is a good pattern. you definitely should not create a producer each time you want to produce a message - it is approximately 500,000 times less efficient.
Kafka producer is stateful. It contains meta info(periodical synced from brokers), send message buffer etc. So create producer for each message is impracticable.

Opening Kafka streams dynamically from a queue consumer

We have a use case where based on work item arriving on a worker queue, we would need to use the message metatdata to decide which Kafka topic to stream our data from. We would have maybe less than 100 worker nodes deployed and each worker node can have a configurable number of threads to receive messages from the queue. So if a worker has "n" threads , we would land up opening maybe kafka streams to "n" different topics. (n is usually less than 10).
Once the worker is done processing the message, we would need to close the stream also.
The worker can receive the next messsage once its acked the first message and at which point , I need to open a kafka stream for another topic.
Also every kafka stream needs to scan all the partitions(around 5-10) for the topic to filter by a certain attribute.
Can a flow like this work for Kafka streams or is this not an optimal approach?
I am not sure if I fully understand the use case, but it seem to be a "simple" copy data from topic A to topic B use case, ie, no data processing/modification. The logic to copy data from input to output topic seems complex though, and thus using Kafka Streams (ie, Kafka's stream processing library) might not be the best fit, as you need more flexibility.
However, using plain KafkaConsumers and KafkaProducers should allow you to implement what you want.

Kafka: How to retrieve a response from consumer?

I wish to describe the following scenario:
I have a node.js backend application (It uses a single thread event loop).
This is the general architecture of the system:
Producer -> Kafka -> Consumer -> Database
Let's say that the producer sends a message to Kafka, and the purpose of this message is the make a certain query in database and retrieve the query result.
However, as we all know Kafka is an asynchronous system. If the producer sends a message to Kafka, it gets a response that the message has been accepted by a Kafka broker. Kafka broker doesn't wait until the consumer polls the message and processes it.
In this case, how can the producer get the query result operated on the database?
The flow using Kafka will look like this:
The only way of the Producer A be aware of what happened with the message consumed by the Consumer A is producing another message. Which will be handled accordingly by any other consumer available (in this case, Consumer B).
As you already mentioned, this flow is asynchronous. This can be useful when you have a very heavy processing on your query, like a report generation or something like that, and the second producer will notify an user inbox for example.
If that is not the case, perhaps you should use HTTP, which is synchronous and you will have the response at the end of processing.
You must generate new flow for communicate the query result:
Consumer (now its a producer) -> Kafka topic -> Producer (now its a consumer)
You should consider using another synchronous communication mechanism like HTTP.

Making Kafka producer and Consumer synchronous

I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .
Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.
Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.

Email notification when Kafka producer and consumer goes down

I have developed a data pipeline using Kafka. Right now I have one type of producer and two types of consumers setup in the cluster.
Producer: gets the message from a windows server
Consumer: Consumer A uses Spark Streaming to transform and present a real time view. Consumer B stores the RAW data, might be useful for building the schema at a later stage.
For various reasons starting from network, the consumers do not receive any data and also it is possible that the consumer process might die in case there is a system failure.
I would be interested in knowing if there is a way to implement something which sends you email notification when the consumer stops receiving messages or the consumer thread dies altogether. Do Kafka or Zookeeper provide a way of doing it?
Right now I am thinking of checking the target system if it is receiving messages or not. But in future if the number of targets increase it will be really complex to write email notification systems for individual targets.