Kafka Streams - Processor API - Forward to different topics - apache-kafka

I have a Processor-API Processor, which internally forwards to several separate sinks (think of an event classifier, although it also has stateful logic between the events). I was thinking of having a join later between two of those topics. Once a join is made, I forward an updated (enriched) version of the elements to those topics I'm actually joining.
How would you mix DSL if in your Processor API code you forward to more than one sink(sink1, sink2) that in turn are sent to topics?
I guess you could you create separate streams, like
val stream1 = builder.stream(outputTopic)
val stream2 = builder.stream(outputTopic2)
and build from there? However this creates more subtopologies - which are the implications here?
Another possibility is to have your own state store in the Processor API and manage it there, in the same Processor (I'm actually doing that). It adds complexity to the code, but wouldn't it be more efficient? For example, you can delete data you no longer use (once a join is made, you can forward new joined data to sinks and it is no longer eligible for a join). Any other efficiency gotcha?

The simplest way might be to mix Processor API with the DSL by starting with a StreamsBuilder and use transform()
StreamsBuilder builder = new StreamsBuilder()
KStream[] streams = builder.stream("input-topic")
.transform(/* put your processor API code here */)
.branch(...);
KStream joined = streams[0].join(streams[1], ...);
Writing the intermediate streams into topic first and read them back is also possible. The fact that you get more sub-topologies should be of no concern.
Doing the join manually via states is possible but hard to code correctly. If possible, I would recommend to use the provided join operator from the DSL.

Related

Apache Flink - Partitioning the stream equally as the input Kafka topic

I would like to implement in Apache Flink the following scenario:
Given a Kafka topic having 4 partitions, I would like to process the intra-partition data independently in Flink using different logics, depending on the event's type.
In particular, suppose the input Kafka topic contains the events depicted in the previous images. Each event have a different structure: partition 1 has the field "a" as key, partition 2 has the field "b" as key, etc. In Flink I would like to apply different business logics depending on the events, so I thought I should split the stream in some way. To achieve what's described in the picture, I thought to do something like that using just one consumer (I don't see why I should use more):
FlinkKafkaConsumer<..> consumer = ...
DataStream<..> stream = flinkEnv.addSource(consumer);
stream.keyBy("a").map(new AEventMapper()).addSink(...);
stream.keyBy("b").map(new BEventMapper()).addSink(...);
stream.keyBy("c").map(new CEventMapper()).addSink(...);
stream.keyBy("d").map(new DEventMapper()).addSink(...);
(a) Is it correct? Also, if I would like to process each Flink partition in parallel, since I'm just interested to process in-order the events sorted by the same Kafka partition, and not considering them globally, (b) how can I do? I know the existence of the method setParallelism(), but I don't know where to apply it in this scenario.
I'm looking for an answer about questions marked (a) and (b). Thank you in advance.
If you can build it like this, it will perform better:
Specifically, what I'm proposing is
Set the parallelism of the entire job to exactly match the number of Kafka partitions. Then each FlinkKafkaConsumer instance will read from exactly one partition.
If possible, avoid using keyBy, and avoid changing the parallelism. Then the source, map, and sink will all be chained together (this is called operator chaining), and no serialization/deserialization and no networking will be needed (within Flink). Not only will this perform well, but you can also take advantage of fine-grained recovery (streaming jobs that are embarrassingly parallel can recover one failed task without interrupting the others).
You can write a general purpose EventMapper that checks to see what type of event is being processed, and then does whatever is appropriate. Or you can try to be clever and implement a RichMapFunction that in its open() figures out which partition is being handled, and loads the appropriate mapper.

Kafka Streams Processor API clear state store

I am using kafka Processor API to do some custom calculations. Because of some complex processing, DSL was not the best fit. The stream code looks like the one below.
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore("storeName");
StoreBuilder<KeyValueStore<String, StoreObject>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
Serdes.String(), storeObjectSerde);
topology.addSource("SourceReadername", stringDeserializer, sourceSerde.deserializer(), "sourceTopic")
.addProcessor("processor", () -> new CustomProcessor("store"), FillReadername)
.addStateStore(storeBuilder, "processor") // define store for processor
.addSink("sinkName", "outputTopic", stringSerializer, resultSerde.serializer(),
Fill_PROCESSOR);
I need to clear some items from the state store based on an event coming in a separate topic. I am not able to find the right way to probably join with another stream using Processor API or some other way to listen to events in another topic to be able to trigger the cleanup code in the CustomProcessor class.
Is there a way we can get events in another topic in Processor API? Or probably mix DSL with Processor API to be able to join the two and send events in any of the topic to the Process method so that I can run the cleanup code when an event is received in the cleanup topic?
Thanks
You just need to add another input topic (add:Source) and add Processor that transforms messages from that topic and based on them remove staff from state store. One note, those topics should use same keys (because of partitioning).

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Can Kafka streams deal with joining streams efficiently?

I'm new to Kafka and I'd like to know if what I'm planning is possible and reasonable to implement.
Suppose we have two sources, s1 and s2 that emit some messages to topics t1 and t2 respectively. Now, I'd like to have a sink which listens to both topics and I'd like it to process tuples of messages <m1, m2> where m1.key == m2.key.
If m1.key was never found in some message of s2, then the sink completely ignores m1.key (will never process it).
In summary, the sink will work only on keys that s1 and s2 worked on.
Some traditional and maybe naive solution would be to have some sort of cache or storage and to work on an item only when both of the messages are in the cache.
I'd like to know if Kafka offers a solution to this problem.
Most modern stream processing engines, such as Apache Flink, Kafka Streams or Spark Streaming can solve this problem for you. All three have battle tested Kafka consumers built for use cases like this.
Even within those frameworks, there are multiple different ways to achieve a streaming join like the above.
In Flink for example, one could use the Table API which has a SQL-like syntax.
What I have used in the past looks a bit like the example in this SO answer (you can just replace fromElements with a Kafka Source).
One thing to keep in mind when working with streams is that you do NOT have any ordering guarantees when consuming data from two Kafka topics t1 and t2. Your code needs to account for messages arriving in any order.
Edit - Just realised your question was probably about how you can implement the join using Kafka Streams as opposed to a stream of data from Kafka. In this case you will probably find relevant info here

We read data from brokers through multiple consumers using consumer group, but how the consumed data is combined?

I need data from kafka brokers,but for fast access I am using multiple consumers with same group id known as consumer groups.But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
By design, different consumers in the same consumer group process data independently from each other. (This behavior is what allows applications to scale well.)
But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
The short but slightly simplified answer when you use Kafka's "Consumer API" (also called: "consumer client" library), which I think is what you are using based on the wording of your question: If you need to combine data from multiple consumers, the easiest option is to make this (new) input data available in another Kafka topic, where you do the combining in a subsequent processing step. A trivial example would be: the other, second Kafka topic would be set up to have just 1 partition, so any subsequent processing step would see all the data that needs to be combined.
If this sounds a bit too complicated, I'd suggest to use Kafka's Streams API, which makes it much easier to define such processing flows (e.g. joins or aggregations, like in your question). In other words, Kafka Streams gives you a lot of the desired built-in "logic" that you are looking for: https://kafka.apache.org/documentation/streams/
The aim of Kafka is to provide you with a scalable, performant and fault tolerant framework. Having a group of consumers reading the data from different partitions asynchronously allows you to archive first two goals. The grouping of the data is a bit outside the scope of standard Kafka flow - you can implement a single partition with single consumer in most simple case but I'm sure that is not what you want.
For such things as aggregation of the single state from different consumers I would recommend you to apply some solution designed specifically for such sort of goals. If you are working in terms of Hadoop, you can use Storm Trident bolt which allows you to aggregate the data from you Kafka spouts. Or you can use Spark Streaming which would allow you to do the same but in a bit different fashion. Or as an option you can always implement your custom component with such logic using standard Kafka libraries.