Kafka Streams and Spring Cloud Stream - Processor Efficiency - apache-kafka

I would like to confirm my understanding of the efficiency of having multiple processors reading from one Kafka Stream source. I believe the following in Example 1 is the most efficient if I want 2 different processes performed depending on Predicate logic. The Predicate looks at the content of the Value (the Notification object here). If you have a breakpoint in each of the following processors in Example 1, it shows each Function is called for each incoming Notification. Whereas in Example 2, you only call the process2 Function if the predicate logic is met.
Example 1
#Bean
public Function<KStream<String, Notification>,KStream<String, Notification>> process1() {
return input -> input
.branch(PREDICATE_FOR_OUT_0, PREDICATE_FOR_OUT_1);
}
#Bean
public Function<KStream<String, Notification>,KStream<String, EnrichedNotification>> process2() {
return input -> input
.filter(PREDICATE_FOR_OUT_2);
.map((key, value) ->.........; //different additional processing to map to EnrichedNotification type
}
There is no need for the following and attempt to route the output of one processor into another? (Not sure that it is even possible)
Example 2 (conceptual)
I am probably thinking this way because I am coming from using pure Kafka. Here process1 has a 3 way branch. Two of the branches go to their respective stream and then topic, but the third requires further processing before it can be routed to a topic.
#Bean
public Function<KStream<String, Notification>,KStream<String, Notification>[]> process1() {
return input -> input
.branch(PREDICATE_FOR_OUT_0, PREDICATE_FOR_OUT_1, PREDICATE_FOR_OUT_2);
}
Could we potentially route the branch for PREDICATE_FOR_OUT_2 into process2. This would mean process2 would only be called if PREDICATE_FOR_OUT_2 was met
#Bean
public Function<KStream<String, Notification>,KStream<String, EnrichedNotification>> process2() {
return input -> input
.map((key, value) ->.........; //different additional processing to map to EnrichedNotification type
}
My thinking is example 2 is redundant (and not actually possible anyway) due to the abstraction and functionality that Kafka Streams gives

I think both cases of your examples can get the job done, but there are some differences. In the first example, you have two functions, both receiving data from the same Kafka topic and the second function performs some additional logic before getting routed to the output topic. In the second example, you again have two functions. In the first function, you have 3 branches, each of them sending data to a Kafka topic (I assume they are 3 different topics). Then in the second function, you receive data from the 3rd output topic from the first function. After performing the logic in that second function of example 2, you send it to the final destination for this branch. You are introducing an extra topic for this second example. I think your first example is more readable and clean.

Related

Kafka - Different KafkaProducers in same transaction

For migration purpose, we need to produce records using two different serialisers, and thus two different KafkaProducers ( one String and one Avro) in the same transaction.
But all the transaction stuff is done through one KafkaProducer instance as follows :
kafkaProducer.beginTransaction();
...
kafkaProducer.send(record);
...
kafkaProducer.commitTransaction();
Can I use a second kafkaProducer (with the second serializer) and use the same transactionnal.id and do like this :
kafkaProducer.beginTransaction();
...
kafkaProducer.send(record);
kafkaProducer2.send(record);
...
kafkaProducer.commitTransaction();
All will be part of the same transaction , all consistent ?
EDIT 1 :
According to what I saw in the java implementation, there is some mechanism when calling commitTransaction() like calling flush() on the producer itself.. so I think the model above won't work..
Any chance of achieving this without instantiating a full new instance of everything in parallel ?
You can only have a single producer active in a transaction at a time.
If you start 2 producers with the same transactional.id, one of them will be fenced and won't be able to commit its records and all ecords won't be part of the same transaction.
You need to use a single producer and one possible workaround is to configure it to use the BytesSerializer and handle the convertion of your Objects to bytes in your logic explicitely.

Multiple StreamListeners with Spring Cloud Stream connected to Kafka

In a Spring Boot app using Spring Cloud Stream connecting to Kafka, I'm trying to set up two separate stream listener methods:
One reads from topics "t1" and "t2" as KTables, re-partitioning using a different key in one, then joining to data from the other
The other reads from an unrelated topic, "t3", as a KStream.
Because the first listener does some joining and aggregating, some topics are created automatically, e.g. "test-1-KTABLE-AGGREGATE-STATE-STORE-0000000007-repartition-0". (Not sure if this is related to the problem or not.)
When I set up the code by having two separate methods annotated with #StreamListener, I get the error below when the Spring Boot app starts:
Exception in thread "test-d44cb424-7575-4f5f-b148-afad034c93f4-StreamThread-2" java.lang.IllegalArgumentException: Assigned partition t1-0 for non-subscribed topic regex pattern; subscription pattern is t3
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignFromSubscribed(SubscriptionState.java:195)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:225)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:367)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:295)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1146)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1111)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:848)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:805)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:771)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:741)
I think the important part is: "Assigned partition t1-0 for non-subscribed topic regex pattern; subscription pattern is t3". These are the two unrelated topics, so as far as I can see nothing related to t3 should be subscribing to anything related to t1. The exact topic which causes the problem also changes intermittently: sometimes it's one of the automatically generated topics which is mentioned, rather than t1 itself.
Here is how the two stream listeners are set up (in Kotlin):
#StreamListener
fun listenerForT1AndT2(
#Input("t1") t1KTable: KTable<String, T1Obj>,
#Input("t2") t2KTable: KTable<String, T2Obj>) {
t2KTable
.groupBy(...)
.aggregate(
{ ... },
{ ... },
{ ... },
Materialized.with(Serdes.String(), JsonSerde(SomeObj::class.java)))
.join(t1KTable,
{ ... },
Materialized.`as`<String, SomeObj, KeyValueStore<Bytes, ByteArray>>("test")
.withKeySerde(Serdes.String())
.withValueSerde(JsonSerde(SomeObj::class.java)))
}
#StreamListener
fun listenerForT3(#Input("t3") t3KStream: KStream<String, T3Obj>) {
events.map { ... }
}
However, when I set up my code by having just one method annotated with #StreamListener, and take parameters for all three topics, everything works fine, e.g.
#StreamListener
fun compositeListener(
#Input("t1") t1KTable: KTable<String, T1Obj>,
#Input("t2") t2KTable: KTable<String, T2Obj>,
#Input("t3") t3KStream: KStream<String, T3Obj>) {
...
}
But I don't think it's right that I can only have one #StreamListener method.
I know that there is content-based routing for adding conditions to the StreamListener annotation, but if the methods define the input channels then I'm not sure if I need to be using this here - I'd have thought the use of the #Input annotations on the method parameters would be enough to tell the system which channels (and therefore which Kafka topics) to bind to? If I do need to use content-based routing, how can I apply it here to have each method receive only the items from the relevant topic(s)?
I've also tried separating out the two listener methods into two separate classes, each of which has #EnableBinding for only the interface it's interested in (i.e. one interface for t1 and t2, and a separate interface for t3), but that doesn't help.
Everything else I've found related to this error message, e.g. here, is about having multiple app instances, but in my case there's only one Spring Boot app instance.
You need separate application id for each StreamListener methods. Here is an example:
spring.cloud.stream.kafka.streams.bindings.t1.consumer.application-id=processor1-application-id
spring.cloud.stream.kafka.streams.bindings.t2.consumer.application-id=processor1-application-id
spring.cloud.stream.kafka.streams.bindings.t3.consumer.application-id=processor2-application-id
You probably want to test with the latest snapshot (2.1.0) as there were some recent changes with the way application id is processed by the binder.
Please see the update here for more details.
Here is a working sample of multiple StreamListener methods which are Kafka Streams processors.

Set timestamp in output with Kafka Streams

I'm getting CSVs in a Kafka topic "raw-data", the goal is to transform them by sending each line in another topic "data" with the right timestamp (different for each line).
Currently, I have 2 streamers:
one to split the lines in "raw-data", sending them to an "internal" topic (no timestamp)
one with a TimestampExtractor that consumes "internal" and send them to "data".
I'd like to remove the use of this "internal" topic by setting directly the timestamp but I couldn't find a way (the timestamp extractor are only used at consumption time).
I've stumbled upon this line in the documentation:
Note, that the describe default behavior can be changed in the Processor API by assigning timestamps to output records explicitly when calling #forward().
but I couldn't find any signature with a timestamp. What do they mean?
How would you do it?
Edit:
To be clear, I have a Kafka topic with one message containing the event time and some value, such as:
2018-01-01,hello
2018-01-02,world
(this is ONE message, not two)
I'd like to get two messages in another topic with the Kafka record timestamp set to their event time (2018-01-01 and 2018-01-02) without the need of an intermediate topic.
Setting the timestamp for the output requires Kafka Streams 2.0 and is only supported in Processor API. If you use the DSL, you can use transform() to use those APIs.
As you pointed out, you would use context.forward(). The call would be:
stream.transform(new TransformerSupplier() {
public Transformer get() {
return new Transformer() {
// omit other methods for brevity
// you need to get the `context` from `init()`
public KeyValue transform(K key, V value) {
// some business logic
// you can call #forward() as often as you want
context.forward(newKey, newValue, To.all().withTimestamp(newTimestamp));
return null; // only return data via context#forward()
}
}
}
});

Kafka Stream Topology on multiple instances

We have a streams topology that will work on multiple machines. We are storing time-windowed aggregation results into state stores.
Since state stores are storing local data, aggregation should be done on another topic for overall aggregation, I think.
But it seems like I am missing something because none of the examples do the overall aggregations on another KStream or Processor.
Do we need to use the groupBy logic for storing overall aggregation, or use a GlobalKtable or just implement our own merger code somewehere?
What is the correct architecture for this?
In below code, I have tried to group all the messages coming to the processor with a constant key to store the overall aggregation on just one machine, but it would lose the parallelism that Kafka supplies, I think.
dashboardItemProcessor = streamsBuilder.stream("Topic25", Consumed.with(Serdes.String(), eventSerde))
.filter((key, event) -> event != null && event.getClientCreationDate() != null);
dashboardItemProcessor.map((key, event) -> KeyValue.pair(key, event.getClientCreationDate().toInstant().toEpochMilli()))
.groupBy((key, event) -> "count", Serialized.with(Serdes.String(), Serdes.Long()))
.windowedBy(timeWindow)
.count(Materialized.as(dashboardItemUtil.getStoreName(itemId, timeWindow)));
In below code, I have tried to group all the messages coming to the processor with a constant key to store the overall aggregation on just one machine, but it would lose the parallelism that Kafka supplies, I think.
This seems to be the right approach. And yes, you loos parallelism, but that is how an global aggregation work. In the end, one machine must compute it...
What you could improve though, is to do a two step approach: ie, first aggregate by "random" keys in parallel, and use a second step with only one key to "merge" the partial aggregates into a single one. This way, some parts of the computation are parallelized and only the final step (on hopefully reduced data load) is non-parallel. Using Kafka Streams, you need to implement this approach "manually".

Max number of tuple replays on Storm Kafka Spout

We’re using Storm with the Kafka Spout. When we fail messages, we’d like to replay them, but in some cases bad data or code errors will cause messages to always fail a Bolt, so we’ll get into an infinite replay cycle. Obviously we’re fixing errors when we find them, but would like our topology to be generally fault tolerant. How can we ack() a tuple after it’s been replayed more than N times?
Looking through the code for the Kafka Spout, I see that it was designed to retry with an exponential backoff timer and the comments on the PR state:
"The spout does not terminate the retry cycle (it is my conviction that it should not do so, because it cannot report context about the failure that happened to abort the reqeust), it only handles delaying the retries. A bolt in the topology is still expected to eventually call ack() instead of fail() to stop the cycle."
I've seen StackOverflow responses that recommend writing a custom spout, but I'd rather not be stuck maintaining a custom patch of the internals of the Kafka Spout if there's a recommended way to do this in a Bolt.
What’s the right way to do this in a Bolt? I don’t see any state in the tuple that exposes how many times it’s been replayed.
Storm itself does not provide any support for your problem. Thus, a customized solution is the only way to go. Even if you do not want to patch KafkaSpout, I think, introducing a counter and breaking the replay cycle in it, would be the best approach. As an alternative, you could also inherit from KafkaSpout and put a counter in your subclass. This is of course somewhat similar to a patch, but might be less intrusive and easier to implement.
If you want to use a Bolt, you could do the following (which also requires some changes to the KafkaSpout or a subclass of it).
Assign an unique IDs as an additional attribute to each tuple (maybe, there is already a unique ID available; otherwise, you could introduce a "counter-ID" or just the whole tuple, ie, all attributes, to identify each tuple).
Insert a bolt after KafkaSpout via fieldsGrouping on the ID (to ensure that a tuple that is replayed is streamed to the same bolt instance).
Within your bolt, use a HashMap<ID,Counter> that buffers all tuples and counts the number of (re-)tries. If the counter is smaller than your threshold value, forward the input tuple so it gets processed by the actual topology that follows (of course, you need to anchor the tuple appropriately). If the count is larger than your threshold, ack the tuple to break the cycle and remove its entry from the HashMap (you might also want to LOG all failed tuples).
In order to remove successfully processed tuples from the HashMap, each time a tuple is acked in KafkaSpout you need to forward the tuple ID to the bolt so that it can remove the tuple from the HashMap. Just declare a second output stream for your KafkaSpout subclass and overwrite Spout.ack(...) (of course you need to call super.ack(...) to ensure KafkaSpout gets the ack, too).
This approach might consume a lot of memory though. As an alternative to have an entry for each tuple in the HashMap you could also use a third stream (that is connected to the bolt as the other two), and forward a tuple ID if a tuple fails (ie, in Spout.fail(...)). Each time, the bolt receives a "fail" message from this third stream, the counter is increase. As long as no entry is in the HashMap (or the threshold is not reached), the bolt simply forwards the tuple for processing. This should reduce the used memory but requires some more logic to be implemented in your spout and bolt.
Both approaches have the disadvantage, that each acked tuple results in an additional message to your newly introduces bolt (thus, increasing network traffic). For the second approach, it might seem that you only need to send a "ack" message to the bolt for tuples that failed before. However, you do not know which tuples did fail and which not. If you want to get rid of this network overhead, you could introduce a second HashMap in KafkaSpout that buffers the IDs of failed messages. Thus, you can only send an "ack" message if a failed tuple was replayed successfully. Of course, this third approach makes the logic to be implemented even more complex.
Without modifying KafkaSpout to some extend, I see no solution for your problem. I personally would patch KafkaSpout or would use the third approach with a HashMap in KafkaSpout subclass and the bolt (because it consumed little memory and does not put a lot of additional load on the network compared to the first two solutions).
Basically it works like this:
If you deploy topologies they should be production grade (this is, a certain level of quality is expected, and the number of tuples low).
If a tuple fails, check if the tuple is actually valid.
If a tuple is valid (for example failed to be inserted because it's not possible to connect to an external database, or something like this) reply it.
If a tuple is miss-formed and can never be handled (for example an database id which is text and the database is expecting an integer) it should be ack, you will never be able to fix such thing or insert it into the database.
New kinds of exceptions, should be logged (as well as the tuple contents itself). You should check these logs and generate the rule to validate tuples in the future. And eventually add code to correctly process them (ETL) in the future.
Don't log everything, otherwise your log files will be huge, be very selective on what do you log. The contents of the log files should be useful and not a pile of rubbish.
Keep doing this, and eventually you will only cover all cases.
We also face the similar data where we have bad data coming in causing the bolt to fail infinitely.
In order to resolve this on runtime, we have introduced one more bolt naming it as "DebugBolt" for reference. So the spout sends the message to this bolt first and then this bolts does the required data fix for the bad messages and then emits them to the required bolt. This way one can fix the data errors on the fly.
Also, if you need to delete some messages, you can actually pass an ignoreFlag from your DebugBolt to your original Bolt and your original bolt should just send an ack to spout without processing if the ignoreFlag is True.
We simply had our bolt emit the bad tuple on an error stream and acked it. Another bolt handled the error by writing it back to a Kafka topic specifically for errors. This allows us to easily direct normal vs. error data flow through the topology.
The only case where we fail a tuple is because some required resource is offline, such as a network connection, DB, ... These are retriable errors. Anything else is directed to the error stream to be fixed or handled as is appropriate.
This all assumes of course, that you don't want to incur any data loss. If you only want to attempt a best effort and ignore after a few retries, then I would look at other options.
As per my knowledge Storm doesn't provide built-in support for this.
I have applied below-mentioned implementation:
public class AuditMessageWriter extends BaseBolt {
private static final long serialVersionUID = 1L;
Map<Object, Integer> failedTuple = new HashMap<>();
public AuditMessageWriter() {
}
/**
* {#inheritDoc}
*/
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
//any initialization if u want
}
/**
* {#inheritDoc}
*/
#Override
public void execute(Tuple input) {
try {
//Write your processing logic
collector.ack(input);
} catch (Exception e2) {
//In case of any exception save the tuple in failedTuple map with a count 1
//Before adding the tuple in failedTuple map check the count and increase it and fail the tuple
//if failure count reaches the limit (message reprocess limit) log that and remove from map and acknowledge the tuple
log(input);
ExceptionHandler.LogError(e2, "Message IO Exception");
}
}
void log(Tuple input) {
try {
//Here u can pass result to dead queue or log that
//And ack the tuple
} catch (Exception e) {
ExceptionHandler.LogError(e, "Exception while logging");
}
}
#Override
public void cleanup() {
// To declare output fields.Not required in this alert.
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// To declare output fields.Not required in this alert.
}
#Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}