How to define AmqpSource to subscribe to multiple exchange? - scala

Right now i am subscribing to single exchange using
AmqpSource.atMostOnceSource(
NamedQueueSourceSettings(..))
I want to be able to subscribe to multiple exchange. Can anyone help me with this?

If there's nothing specific for this for a particular alpakka source you can use either a Merge or MerbeHub.
If you know all of the sources up front you can combine multiple Sources into one using a Merge
If you don't know all of the sources up front you can use a MergeHub e.g.
// A simple consumer that will print to the console for now
val consumer = Sink.foreach(println)
// Attach a MergeHub Source to the consumer. This will materialize to a
// corresponding Sink.
val runnableGraph: RunnableGraph[Sink[String, NotUsed]] =
MergeHub.source[String](perProducerBufferSize = 16).to(consumer)
// By running/materializing the consumer we get back a Sink, and hence
// now have access to feed elements into it. This Sink can be materialized
// any number of times, and every element that enters the Sink will
// be consumed by our consumer.
val toConsumer: Sink[String, NotUsed] = runnableGraph.run()
// Feeding two independent sources into the hub.
AmqpSource.atMostOnceSource(
NamedQueueSourceSettings(..)).runWith(toConsumer)
AmqpSource.atMostOnceSource(
NamedQueueSourceSettings(..)).runWith(toConsumer)

Related

Akka Source that emits when another Sink receives

I have a source a that emits values into a sink b.
Now I want to have another source c that emits a value, everytime b receives an event.
My idea was to use another sink d that can be used as a notifier, but then I need the functionality to create a Source from a Sink.
a.to(b).alsoTo(d)
something like
Source.from(d)
Another way of describing this is that you want every event emitted by a to go to both b and c. This is what a BroadcastHub does; it can be used to allow events from one Source to be consumed by multiple Sinks.
If you connect a Source to a BroadcastHub.sink and then materialise it, you get a new Source. This Source can then be attached to 2 or more Sinks and each Sink will get a copy of the message sent by the original Source.
For example I use this with Akka to have a Actor that broadcasts messages to multiple clients (for gRPC events):
val (actorRef: ActorRef[Event], eventSource: Source[Event, akka.NotUsed]) =
ActorSource.actorRef[Event](
completionMatcher = PartialFunction.empty,
failureMatcher = PartialFunction.empty,
16,
OverflowStrategy.fail
)
.toMat(BroadcastHub.sink)(Keep.both)
.run()
This creates eventSource which can be used in a pipeline and materialised multiple times to create multiple streams. Each time a message is sent to the actorRef, every stream that was materialised from eventSource receives that message.
See the documentation for more details.

Write multiple topics in Kafka by Flink dynamically Exception handling

I am currently reading data from a single kafka topic and writing to dynamic topic based on data itself. I have implemented the following code (which is dynamically selecting topic based on accountId of data) and it working just fine :
class KeyedEnrichableEventSerializationSchema(schemaRegistryUrl: String)
extends KafkaSerializationSchema[KeyedEnrichableEvent]
with KafkaContextAware[KeyedEnrichableEvent] {
private val enrichableEventClass = classOf[EnrichableEvent]
private val enrichableEventSerialization: AvroSerializationSchema[EnrichableEvent] =
ConfluentRegistryAvroSerializationSchema.forSpecific(enrichableEventClass, enrichableEventClass.getCanonicalName, schemaRegistryUrl)
override def serialize(element: KeyedEnrichableEvent, timestamp: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] =
new ProducerRecord("trackingevents."+element.value.getEventMetadata.getAccountId, element.key, enrichableEventSerialization.serialize(element.value))
override def getTargetTopic(element: KeyedEnrichableEvent): String = "trackingevents."+element.value.getEventMetadata.getAccountId
The problem/concern is, if the topic does not exist, i am having an exception in JobManager UI about the topic not present and whole processing is halted until I create the topic. May be this is recommended behaviour but is there any alternate like maybe we can put the data in different topic and resume processing instead of stopping whole processing. Or at least is there any notification mechanism available in flink which notifies immediately that the processing is halted.
I think that the simplest solution in this case is to enable autocreation of topics in Kafka, so that problem is solved totally.
However, if for some reason it's impossible to be done in Your case, the simplest solution IMHO would be to create a ProcessFunction, which would keep a connection to Kafka using KafkaConsumer or AdminClient and periodically check if the topic that would be used for given message exists. You could then whether push the data to sideOutput or try to create topics that are missing.

Kafka Streams Processor API clear state store

I am using kafka Processor API to do some custom calculations. Because of some complex processing, DSL was not the best fit. The stream code looks like the one below.
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore("storeName");
StoreBuilder<KeyValueStore<String, StoreObject>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
Serdes.String(), storeObjectSerde);
topology.addSource("SourceReadername", stringDeserializer, sourceSerde.deserializer(), "sourceTopic")
.addProcessor("processor", () -> new CustomProcessor("store"), FillReadername)
.addStateStore(storeBuilder, "processor") // define store for processor
.addSink("sinkName", "outputTopic", stringSerializer, resultSerde.serializer(),
Fill_PROCESSOR);
I need to clear some items from the state store based on an event coming in a separate topic. I am not able to find the right way to probably join with another stream using Processor API or some other way to listen to events in another topic to be able to trigger the cleanup code in the CustomProcessor class.
Is there a way we can get events in another topic in Processor API? Or probably mix DSL with Processor API to be able to join the two and send events in any of the topic to the Process method so that I can run the cleanup code when an event is received in the cleanup topic?
Thanks
You just need to add another input topic (add:Source) and add Processor that transforms messages from that topic and based on them remove staff from state store. One note, those topics should use same keys (because of partitioning).

Kafka Streams - Using An Existing State Store After Adding a New Source Stream

I have an existing stream which uses two topics as its source:
val streamsBuilder = new StreamsBuilder
val stream1 = streamsBuilder.stream[K, V]("topic1")
val stream2 = streamsBuilder.stream[K, V]("topic2")
stream1
.merge(stream2)
.groupByKey
.reduce(reduceValues)
.toStream
.to("result-topic")
The auto-generated name of the StateStore is KSTREAM-REDUCE-STATE-STORE-0000000003.
Now I need to add one more topic as a source. However, adding a new source increments a kafka-internal number, causing the StateStore to be KSTREAM-REDUCE-STATE-STORE-0000000005. I don't want to lose the current state, so I explicitly provide the name of the old StateStore:
val streamsBuilder = new StreamsBuilder
val stream1 = streamsBuilder.stream[K, V]("topic1")
val stream2 = streamsBuilder.stream[K, V]("topic2")
val stream3 = streamsBuilder.stream[K, V]("topic3") // new topic
stream1
.merge(stream2)
.merge(stream3) // merge new topic
.groupByKey
.reduce(reduceValues)(Materialized.as("KSTREAM-REDUCE-STATE-STORE-0000000003")
.toStream
.to("result-topic")
It seems to work, but I'm not sure if I'm interfering with the Kafka internals because:
I'm using a custom name in the form of what Kafka would auto-generate (possibility of a name conflict?)
The set of streams used to feed this StateStore is different than what it was initially.
Any comments?
To be honest, the safest option would be to add human-readable name to this state but, as you mentioned, you are going to lose it.
I assume there shouldn't be any problem with what you did (at least until you introduce another change in code :)). ID 0000000003 is going to be assigned to groupByKey operator so there won't be any conflicts (although I am not 100% sure about Kafka Streams internals there).
There is also Application Reset Tool that allows you to regenerate aggregations. But I don't know if it is applicable to your case: your retention policy on input topics might prevent this tool to regenerate exact aggregates.

Sort RDD in Spark before publishing it to Kafka?

In my code, I first subscribe to a Kafka stream, process each RDD to create an instance of my class People and then, I want to publish the result set (Dataset[People]) to a specific topic to Kafka. It is important to note that not every incoming message received from Kafka maps to an instance of People. Moreover, instances of people should be sent to Kafka in exactly the same order as received from Kafka.
However, I am not sure if sorting is really necessary or if the instances of People maintain the same order when the respective code is run on the executors (and I can directly publish my Dataset to Kafka). As far as I understand, sorting is necessary, because the code inside foreachRDD can be executed on different nodes in the cluster. Is this correct?
Here's my code:
val myStream = KafkaUtils.createDirectStream[K, V](streamingContext, PreferConsistent, Subscribe[K, V](topics, consumerConfig))
def process(record: (RDD[ConsumerRecord[String, String]], Time)): Unit = record match {
case (rdd, time) if !rdd.isEmpty =>
// More Code...
// In the end, I have: Dataset[People]
case _ =>
}
myStream.foreachRDD((x, y) => process((x, y))) // Do I have to replace this call with map, sort the RDD and then publish it to Kafka?
Moreover, instances of people should be sent to Kafka in exactly the same order as received from Kafka.
Unless you have a single partition (and then you wouldn't use Spark, would you?) the order in which data is received is not deterministic, and similarly order in which data is send won't be. Sorting doesn't make any difference here.
If you need a very specific order of processing (it is typically a design mistake, if you work with data intensive applications) you need a sequential application, or system with much more granular control than Spark.