Akka Source that emits when another Sink receives - scala

I have a source a that emits values into a sink b.
Now I want to have another source c that emits a value, everytime b receives an event.
My idea was to use another sink d that can be used as a notifier, but then I need the functionality to create a Source from a Sink.
a.to(b).alsoTo(d)
something like
Source.from(d)

Another way of describing this is that you want every event emitted by a to go to both b and c. This is what a BroadcastHub does; it can be used to allow events from one Source to be consumed by multiple Sinks.
If you connect a Source to a BroadcastHub.sink and then materialise it, you get a new Source. This Source can then be attached to 2 or more Sinks and each Sink will get a copy of the message sent by the original Source.
For example I use this with Akka to have a Actor that broadcasts messages to multiple clients (for gRPC events):
val (actorRef: ActorRef[Event], eventSource: Source[Event, akka.NotUsed]) =
ActorSource.actorRef[Event](
completionMatcher = PartialFunction.empty,
failureMatcher = PartialFunction.empty,
16,
OverflowStrategy.fail
)
.toMat(BroadcastHub.sink)(Keep.both)
.run()
This creates eventSource which can be used in a pipeline and materialised multiple times to create multiple streams. Each time a message is sent to the actorRef, every stream that was materialised from eventSource receives that message.
See the documentation for more details.

Related

Akka stream best practice for dynamic Source and Flow controlled by websocket messages

I'm trying to understand what is the best way to implement with akka stream and alpakka the following scenario (simplified):
The frontend opens a websocket connection with the backend
Backend should wait an initialization message with some parameters (for example bootstrapServers, topicName and a transformationMethod that is a string parameter)
Once these informations are in place, backend can start the alpakka consumer to consume from topic topicName from bootstrapServers and applying some transformation to the data based on transformationMethod, pushing these results inside the websocket
Periodically, frontend can send through the websocket messages that changes the transformationMethod field, so that the transformation algorithm of the messages consumed from Kafka can dynamically change, based on the value of transformationMethod provided into the websocket.
I don't understand if it's possible to achieve this on akka stream inside a Graph, especially the dynamic part, both for the initialization of the alpakka consumer and also for the dynamic changing of the transformationMethod parameter.
Example:
Frontend establish connection, and after 10 second it sends trough the socket the following:
{"bootstrapServers": "localhost:9092", "topicName": "topic", "transformationMethod": "PLUS_ONE"}
Because of that, Alpakka consumer is instantiated and starts reading messages from Kafka.
Messages are flowing in Kafka, so it arrives 1 and in the websocket the frontend will receive 2 (because of the PLUS_ONE transformation method, that is probably placed in a map or a via with a Flow), then 2 and so frontend receives 3 and so on.
Then, frontend sends:
{"transformationMethod": "SQUARE"}
So now, from Kafka arrives 3 and the frontend will receive 9, then 4 and so the output will be 16 ecc...
This is more or less the flow of what I would like to obtain.
I am able to create a websocket connection with Alpakka consumer that perform some sort of "static" transformations and push back the result to the websocket, it's straightforward, what I miss is this dynamic part but I'm not sure if i can implement that inside the same graph or if I need more layers (maybe with some Actor that manages the flow and will activate/change the behavior of the Alpakka consumer in real time sending messages?)
Thanks
I would probably tend to implement this by spawning an actor for each websocket, prematerializing a Source which will receive messages from the actor (probably using ActorSource.actorRefWithBackpressure), building a Sink (likely using ActorSink.actorRefWithBackpressure) which adapts incoming websocket messages into control-plane messages (initialization (including the ActorRef associated with the prematerialized source) and transformation changes) and sends them to the actor, and then tying them together using the handleMessagesWithSinkSource on WebsocketUpgrade.
The actor you're spawning would, on receipt of the initialization message, start a stream which is feeding messages to it from Kafka; some backpressure can be fed back to Kafka by having the stream feed messages via an ask protocol which waits for an ack; in order to keep that stream alive, the actor would need to ack within a certain period of time regardless of what the downstream did, so there's a decision to be made around having the actor buffer messages or drop them.

Kafka Streams Processor API clear state store

I am using kafka Processor API to do some custom calculations. Because of some complex processing, DSL was not the best fit. The stream code looks like the one below.
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore("storeName");
StoreBuilder<KeyValueStore<String, StoreObject>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
Serdes.String(), storeObjectSerde);
topology.addSource("SourceReadername", stringDeserializer, sourceSerde.deserializer(), "sourceTopic")
.addProcessor("processor", () -> new CustomProcessor("store"), FillReadername)
.addStateStore(storeBuilder, "processor") // define store for processor
.addSink("sinkName", "outputTopic", stringSerializer, resultSerde.serializer(),
Fill_PROCESSOR);
I need to clear some items from the state store based on an event coming in a separate topic. I am not able to find the right way to probably join with another stream using Processor API or some other way to listen to events in another topic to be able to trigger the cleanup code in the CustomProcessor class.
Is there a way we can get events in another topic in Processor API? Or probably mix DSL with Processor API to be able to join the two and send events in any of the topic to the Process method so that I can run the cleanup code when an event is received in the cleanup topic?
Thanks
You just need to add another input topic (add:Source) and add Processor that transforms messages from that topic and based on them remove staff from state store. One note, those topics should use same keys (because of partitioning).

Using Single SQS for multiple subscribers based on message identifier

We have application where multiple subscribers are writing to publisher Kafka topic This data is then propagated to specific subscriber topic then subscriber consumes this data from specific topic assigned to it.
We want to use SQS for same purpose but issue is we will again need an SQS for each subscriber.
Handling these multiple SQS will create an issue and if there is time when no data is published to subscriber the queue assigned to it will be idle.
Is there any way i can use single SQS from which all subscribers can consumed messages base don message identifier.
Challenges needs to be cover in this design are:
Each subscriber can get its message based on identifier
Latency must not be there in case one publisher publish very few messages and other one is publishing it in millions.
We can have one SQS for each publisher but single SQS for all subscribers of this publisher.
Can any one suggest any architecture using similar implementation.
Thanks
I think you can achieve it by setting up a single SQS queue. You would want to set up a Lambda trigger on that queue which will serve as a Service Manager (SM). SM will have a static JSON file that define the mapping between message identifier and its subscriber/worker. SM will receive an SQS message event, find the message attribute used for identifier, and then look up in JSON to find the corresponding subscriber. If subscriber is found, SM will invoke it.
Consider using SQS batch trigger.

How to define AmqpSource to subscribe to multiple exchange?

Right now i am subscribing to single exchange using
AmqpSource.atMostOnceSource(
NamedQueueSourceSettings(..))
I want to be able to subscribe to multiple exchange. Can anyone help me with this?
If there's nothing specific for this for a particular alpakka source you can use either a Merge or MerbeHub.
If you know all of the sources up front you can combine multiple Sources into one using a Merge
If you don't know all of the sources up front you can use a MergeHub e.g.
// A simple consumer that will print to the console for now
val consumer = Sink.foreach(println)
// Attach a MergeHub Source to the consumer. This will materialize to a
// corresponding Sink.
val runnableGraph: RunnableGraph[Sink[String, NotUsed]] =
MergeHub.source[String](perProducerBufferSize = 16).to(consumer)
// By running/materializing the consumer we get back a Sink, and hence
// now have access to feed elements into it. This Sink can be materialized
// any number of times, and every element that enters the Sink will
// be consumed by our consumer.
val toConsumer: Sink[String, NotUsed] = runnableGraph.run()
// Feeding two independent sources into the hub.
AmqpSource.atMostOnceSource(
NamedQueueSourceSettings(..)).runWith(toConsumer)
AmqpSource.atMostOnceSource(
NamedQueueSourceSettings(..)).runWith(toConsumer)

custom Flume interceptor: intercept() method called multiple times for the same Event

TL;DR
When a Flume source fails to push a transaction to the next channel in the pipeline, does it always keep event instances for the next try?
In general, is it safe to have a stateful Flume interceptor, where processing of events depends on previously processed events?
Full problem description:
I am considering the possibility of leveraging guarantees offered by Apache Kafka regarding the way topic partitions are distributed among consumers in a consumer group to perform streaming deduplication in an existing Flume-based log consolidation architecture.
Using the Kafka Source for Flume and custom routing to Kafka topic partitions, I can ensure that every event that should go to the same logical "deduplication queue" will be processed by a single Flume agent in the cluster (for as long as there are no agent stops/starts within the cluster). I have the following setup using a custom-made Flume interceptor:
[KafkaSource with deduplication interceptor]-->()MemoryChannel)-->[HDFSSink]
It seems that when the Flume Kafka source runner is unable to push a batch of events to the memory channel, the event instances that are part of the batch are passed again to my interceptor's intercept() method. In this case, it was easy to add a tag (in the form of a Flume event header) to processed events to distinguish actual duplicates from events in a failed batch that got re-processed.
However, I would like to know if there is any explicit guarantee that Event instances in failed transactions are kept for the next try or if there is the possibility that events are read again from the actual source (in this case, Kafka) and re-built from zero. In that case, my interceptor will consider those events to be duplicates and discard them, even though they were never delivered to the channel.
EDIT
This is how my interceptor distinguishes an Event instance that was already processed from a non-processed event:
public Event intercept(Event event) {
Map<String,String> headers = event.getHeaders();
// tagHeaderName is the name of the header used to tag events, never null
if( !tagHeaderName.isEmpty() ) {
// Don't look further if event was already processed...
if( headers.get(tagHeaderName)!=null )
return event;
// Mark it as processed otherwise...
else
headers.put(tagHeaderName, "");
}
// Continue processing of event...
}
I encountered the similar issue:
When a sink write failed, Kafka Source still hold the data that has already been processed by interceptors. In next attempt, those data will send to interceptors, and get processed again and again. By reading the KafkaSource's code, I believe it's bug.
My interceptor will strip some information from origin message, and will modify the origin message. Due to this bug, the retry mechanism will never work as expected.
So far, The is no easy solution.