I am currently reading data from a single kafka topic and writing to dynamic topic based on data itself. I have implemented the following code (which is dynamically selecting topic based on accountId of data) and it working just fine :
class KeyedEnrichableEventSerializationSchema(schemaRegistryUrl: String)
extends KafkaSerializationSchema[KeyedEnrichableEvent]
with KafkaContextAware[KeyedEnrichableEvent] {
private val enrichableEventClass = classOf[EnrichableEvent]
private val enrichableEventSerialization: AvroSerializationSchema[EnrichableEvent] =
ConfluentRegistryAvroSerializationSchema.forSpecific(enrichableEventClass, enrichableEventClass.getCanonicalName, schemaRegistryUrl)
override def serialize(element: KeyedEnrichableEvent, timestamp: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] =
new ProducerRecord("trackingevents."+element.value.getEventMetadata.getAccountId, element.key, enrichableEventSerialization.serialize(element.value))
override def getTargetTopic(element: KeyedEnrichableEvent): String = "trackingevents."+element.value.getEventMetadata.getAccountId
The problem/concern is, if the topic does not exist, i am having an exception in JobManager UI about the topic not present and whole processing is halted until I create the topic. May be this is recommended behaviour but is there any alternate like maybe we can put the data in different topic and resume processing instead of stopping whole processing. Or at least is there any notification mechanism available in flink which notifies immediately that the processing is halted.
I think that the simplest solution in this case is to enable autocreation of topics in Kafka, so that problem is solved totally.
However, if for some reason it's impossible to be done in Your case, the simplest solution IMHO would be to create a ProcessFunction, which would keep a connection to Kafka using KafkaConsumer or AdminClient and periodically check if the topic that would be used for given message exists. You could then whether push the data to sideOutput or try to create topics that are missing.
Related
I'm aware that you can define stream-processing Kafka application in the form of a topology that implicitly understands which record has gone through successfully, and therefore can correctly commit the consumer offset so that when the microservice has to be restarted, it will continue reading the input toppic without missing messages.
But what happens when I introduce my own processing classes into the stream? For instance, perhaps I need to submit information from the input records to a web service with a big startup time. So I write my own processor class that accumulates, say, 1000 messages and then submits a batch request to the external service, like this.
KStream<String, Prediction> stream = new StreamsBuilder()
.stream(inputTopic, Consumed.with(Serdes.String(), new MessageSerde()))
// talk to web service
.map((k, v) -> new KeyValue<>("", wrapper.consume(v.getPayload())))
.flatMapValues((ValueMapper<List<Prediction>, Iterable<Prediction>>) value -> value);
// send downstream
stream.peek((k, v) -> metrics.countOutgoingMessage())
.to(outputTopic, Produced.with(Serdes.String(), new PredictionSerde()));
Assuming that the external service can issue zero, one or more predictions of some kind for every input, and that my wrapper submits inputs in batches to increase throughput. It seems to me that KStream cannot possibly keep track of which input record corresponds to which output record, and therefore no matter how it is implemented, it cannot guarantee that the correct consumer offset for the input topic is committed.
So in this paradigm, how can I give the library hints about which messages have been successfully processed? Or failing that, how can I get access to the consumer offset for the topic and perform commits explicitly so that no data loss can occur?
I think you would might have a problem if you are using map. combining remote calls in a DSL operator is not recommended. You might want to look into using the Processor API docs. With ProcessorContext you can forward or commit which could give you flexibility you need.
I am using kafka Processor API to do some custom calculations. Because of some complex processing, DSL was not the best fit. The stream code looks like the one below.
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore("storeName");
StoreBuilder<KeyValueStore<String, StoreObject>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
Serdes.String(), storeObjectSerde);
topology.addSource("SourceReadername", stringDeserializer, sourceSerde.deserializer(), "sourceTopic")
.addProcessor("processor", () -> new CustomProcessor("store"), FillReadername)
.addStateStore(storeBuilder, "processor") // define store for processor
.addSink("sinkName", "outputTopic", stringSerializer, resultSerde.serializer(),
Fill_PROCESSOR);
I need to clear some items from the state store based on an event coming in a separate topic. I am not able to find the right way to probably join with another stream using Processor API or some other way to listen to events in another topic to be able to trigger the cleanup code in the CustomProcessor class.
Is there a way we can get events in another topic in Processor API? Or probably mix DSL with Processor API to be able to join the two and send events in any of the topic to the Process method so that I can run the cleanup code when an event is received in the cleanup topic?
Thanks
You just need to add another input topic (add:Source) and add Processor that transforms messages from that topic and based on them remove staff from state store. One note, those topics should use same keys (because of partitioning).
Is there any way to make my Kafka Stream application automatically read from the newly created topic?
Even if the topic is created while the stream application is already running?
Something like having a wildcard in topic name like this:
KStream<String, String> rawText = builder.stream("topic-input-*");
Why I need this?
Right now, I've multiple clients sending data(all with the same schema) to their own topic and my stream application reads from those topics. Then my application does some transformation and writes the result to a single topic.
Although all of the clients could write to the same topic, an unbehaving client could also write on behalf of someone else. So I've created individual topics for each client. The problem is, whenever a new client comes, I create the new topic and set the ACL for them with a script but that is not enough. I also have to stop my streaming application, edit the code, add the new topic, compile it, package it, put it on the server and run it again!
Kafka Streams supports patter subscription:
builder.stream(Pattern.compile("topic-input-*"));
(I hope the syntax is right; not sure from the top of my head... But the point is, instead of passing in a String you can user an overload of the stream() method that takes a pattern.)
suppose my producer is writing the message to Topic A...once the message is in Topic A, i want to copy the same message to Topic B. Is this possible in kafka?
If I understand correctly, you just want stream.to("topic-b"), although, that seems strange without doing something to the data.
Note:
The specified topic should be manually created before it is used
I am not clear about what use case you are exactly trying to achieve by simply copying data from one topic to another topic. If both the topics are in the same Kafka cluster then it is never a good idea to have two topics with the same message/content.
I believe the gap here is that probably you are not clear about the concept of the Consumer group in Kafka. Probably you have two action items to do by consuming the message from the Kafka topic. And you are believing that if the first application consumes the message from the Kafka topic, will it be available for the second application to consume the same message or not. Kafka allows you to solve this kind of common use case with the help of the consumer group.
Let's try to differentiate between other message queue and Kafka and you will understand that you do not need to copy the same data/message between two topics.
In other message queues, like SQS(Simple Queue Service) where if the message is consumed by a consumer, the same message is not available to get consumed by other consumers. It is the responsibility of the consumer to delete the message safely once it has processed the message. By doing this we guarantee that the same message should not get processed by two consumers leading to inconsistency.
But, In Kafka, it is totally fine to have multiple sets of consumers consuming from the same topic. The set of consumers form a group commonly termed as the consumer group. Here one of the consumers from the consumer group can process the message based on the partition of the Kafka topic the message is getting consumed from.
Now the catch here is that we can have multiple consumer groups consuming from the same Kafka topic. Each consumer group will process the message in the way they want to do. There is no interference between consumers of two different consumer groups.
To fulfill your use case I believe you might need two consumer groups that can simply process the message in the way they want. You do not essentially have to copy the data between two topics.
Hope this helps.
There are two immediate options to forward the contents of one topic to another:
by using the stream feature of Kafka to create a forwarding link
between the two topics.
by creating a consumer / producer pair
and using those to receive and then forward on messages
I have a short piece of code that shows both (in Scala):
def topologyPlan(): StreamsBuilder = {
val builder = new StreamsBuilder
val inputTopic: KStream[String, String] = builder.stream[String, String]("topic2")
inputTopic.to("topic3")
builder
}
def run() = {
val kafkaStreams = createStreams(topologyPlan())
kafkaStreams.start()
val kafkaConsumer = createConsumer()
val kafkaProducer = createProducer()
kafkaConsumer.subscribe(List("topic1").asJava)
while (true) {
val record = kafkaConsumer.poll(Duration.ofSeconds(5)).asScala
for (data <- record.iterator) {
kafkaProducer.send(new ProducerRecord[String, String]("topic2", data.value()))
}
}
}
Looking at the run method, the first two lines set up a streams object to that uses the topologyPlan() to listen for messages in 'topic2' and forward then to 'topic3'.
The remaining lines show how a consumer can listen to a 'topic1' and use a producer to send them onward to 'topic2'.
The final point of the example here is Kafka is flexible enough to let you mix options depending on what you need, so the code above will take messages in 'topic1', and send them to 'topic3' via 'topic2'.
If you want to see the code that sets up consumer, producer and streams, see the full class here.
I am building a pretty straightforward KafkaStreams demo application, to test a use case.
I am not able to upgrade the Kafka broker I am using (which is currently on version 0.10.0), and there are several messages written by a pre-0.10.0 Producer, so I am using a custom TimestampExtractor, which I add as a default to the config in the beginning of my main class:
config.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, GenericRecordTimestampExtractor.class);
When consuming from my source topic, this works perfectly fine. But when using an aggregation operator, I run into an exception because the FailOnInvalidTimestamp implementation of TimestampExtractor is used instead of the custom implementation when consuming from the internal aggregation topic.
The code of the Streams app looks something like this:
...
KStream<String, MyValueClass> clickStream = streamsBuilder
.stream("mytopic", Consumed.with(Serdes.String(), valueClassSerde));
KTable<Windowed<Long>, Long> clicksByCustomerId = clickStream
.map(((key, value) -> new KeyValue<>(value.getId(), value)))
.groupByKey(Serialized.with(Serdes.Long(), valueClassSerde))
.windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(1)))
.count();
...
The Exception I'm encountering is the following:
Exception in thread "click-aggregator-b9d77f2e-0263-4fa3-bec4-e48d4d6602ab-StreamThread-1" org.apache.kafka.streams.errors.StreamsException:
Input record ConsumerRecord(topic = click-aggregator-KSTREAM-AGGREGATE-STATE-STORE-0000000002-repartition, partition = 9, offset = 0, CreateTime = -1, serialized key size = 8, serialized value size = 652, headers = RecordHeaders(headers = [], isReadOnly = false), key = 11230, value = org.example.MyValueClass#2a3f2ea2) has invalid (negative) timestamp.
Possibly because a pre-0.10 producer client was used to write this record to Kafka without embedding a timestamp, or because the input topic was created before upgrading the Kafka cluster to 0.10+. Use a different TimestampExtractor to process this data.
Now the question is: Is there any way I can make Kafka Streams use the custom TimestampExtractor when reading from the internal aggregation topic (optimally while still using the Streams DSL)?
You cannot change the timestamp extractor (as of v1.0.0). This is not allowed for correctness reasons.
But I am really wondering, how a record with timestamp -1 is written into this topic in the first place. Kafka Streams uses the timestamp that was provided by your custom extractor when writing the record. Also note, that KafkaProducer does not allow to write records with negative timestamp.
Thus, the only explanation I can think of is that some other producer did write into the repartitioning topic -- and this is not allowed... Only Kafka Streams should write into the repartioning topic.
I guess, you will need to delete this topic and let Kafka Streams recreate it to get back into a clean state.
From the discussion/comment of the other answer:
You need 0.10+ format to work with Kafka Streams. If you upgrade your brokers and keep 0.9 format or older, Kafka Streams might not work as expected.
It is well known issue :-). I have the same problem with old clients in the projects which are still using older Kafka clients like 0.9 and also when communicating with some "not certified" .NET clients.
Therefore I wrote dedicated class:
public class MyTimestampExtractor implements TimestampExtractor {
private static final Logger LOG = LogManager.getLogger( MyTimestampExtractor.class );
#Override
public long extract ( ConsumerRecord<Object, Object> consumerRecord, long previousTimestamp ) {
final long timestamp = consumerRecord.timestamp();
if ( timestamp < 0 ) {
final String msg = consumerRecord.toString().trim();
LOG.warn( "Record has wrong Kafka timestamp: {}. It will be patched with local timestamp. Details: {}", timestamp, msg );
return System.currentTimeMillis();
}
return timestamp;
}
}
When there are many messages you may skip logging, as it may flood.
After reading Matthias' answer I double checked everything and the cause of the issue were incompatible versions between the Kafka Broker and the Kafka Streams app. I was stupid enough to use Kafka Streams 1.0.0 with a 0.10.1.1 Broker, which is clearly stated as incompatible in the Kafka Wiki here.
Edit (thx to Matthias): The actual cause of the problem was the fact, that the log format used by our 0.10.1.x broker was still 0.9.0.x, which is incompatible with Kafka Streams.