I'm getting CSVs in a Kafka topic "raw-data", the goal is to transform them by sending each line in another topic "data" with the right timestamp (different for each line).
Currently, I have 2 streamers:
one to split the lines in "raw-data", sending them to an "internal" topic (no timestamp)
one with a TimestampExtractor that consumes "internal" and send them to "data".
I'd like to remove the use of this "internal" topic by setting directly the timestamp but I couldn't find a way (the timestamp extractor are only used at consumption time).
I've stumbled upon this line in the documentation:
Note, that the describe default behavior can be changed in the Processor API by assigning timestamps to output records explicitly when calling #forward().
but I couldn't find any signature with a timestamp. What do they mean?
How would you do it?
Edit:
To be clear, I have a Kafka topic with one message containing the event time and some value, such as:
2018-01-01,hello
2018-01-02,world
(this is ONE message, not two)
I'd like to get two messages in another topic with the Kafka record timestamp set to their event time (2018-01-01 and 2018-01-02) without the need of an intermediate topic.
Setting the timestamp for the output requires Kafka Streams 2.0 and is only supported in Processor API. If you use the DSL, you can use transform() to use those APIs.
As you pointed out, you would use context.forward(). The call would be:
stream.transform(new TransformerSupplier() {
public Transformer get() {
return new Transformer() {
// omit other methods for brevity
// you need to get the `context` from `init()`
public KeyValue transform(K key, V value) {
// some business logic
// you can call #forward() as often as you want
context.forward(newKey, newValue, To.all().withTimestamp(newTimestamp));
return null; // only return data via context#forward()
}
}
}
});
Related
I am using Kafka connect to send elasticsearch data to kafka.
Once the connector is running, a topic is automatically created whose name is the elasticsearch index followed by a prefix.
Now, I would like to split this topic into N topics by condition
all my output kafka topic is like this:
{"schema":
{"type":"struct",
"fields":[
{"type":"string","optional":true,"field":"nature"},
{"type":"string","optional":true,"field":"description"},
{"type":"string","optional":true,"field":"threshold"},
{"type":"string","optional":true,"field":"quality"},
{"type":"string","optional":true,"field":"rowid"},
{"type":"string","optional":true,"field":"avrotimestamp"},
{"type":"array","items":{"type":"string","optional":true},"optional":true,"field":"null"},
{"type":"string","optional":true,"field":"domain"},
{"type":"string","optional":true,"field":"name"},
{"type":"string","optional":true,"field":"avroversion"},
{"type":"string","optional":true,"field":"esindex"},
{"type":"string","optional":true,"field":"value"},
{"type":"string","optional":true,"field":"chrono"},
{"type":"string","optional":true,"field":"esid"},
{"type":"string","optional":true,"field":"ts"}],"optional":false,"name":"demofilter"},
"payload":
{
"nature":"R01",
"description":"Energy",
"threshold":"","quality":"192",
"rowid":"34380941",
"avrotimestamp":"2022-09-20T04:00:11.939Z",
"null":["TGT BQ 1B"],
"domain":"CFO",
"name":"RDC.R01.RED.MES",
"avroversion":"1",
"esindex":"demo_filter",
"value":"4468582",
"chrono":"133081200000000000",
"esid":"nuWIrYMBHyNMgyhJYscV",
"ts":"2022-09-20T02:00:00.000Z"
}
}
the description field takes several values but should contain one of these keywords: energy, electric, and temperature (example: life energy, body temperature, car energy)
the goal is that when the description field has the energy keyword, the data must be sent to the energy topic and so on, all in real time of course
what i was looking for:
according to my research kafka stream is an option, unfortunately with the wordcount example I can't figure out how I can do it. (I'm learning kafka stream for data processing).
use python to sort after consuming the data but it takes time and loses the word in real time
What should I do?
Using Kafka Streams, you can make dynamic routing decisions in the to() based on whatever is in the payload of an event. Here, the name of the output topic is derived from event data.
myStream.to(
(eventId, event, record) -> "topic-prefix-" + event.methodOfYourEventLikeGetTypeName()
);
I would like to confirm my understanding of the efficiency of having multiple processors reading from one Kafka Stream source. I believe the following in Example 1 is the most efficient if I want 2 different processes performed depending on Predicate logic. The Predicate looks at the content of the Value (the Notification object here). If you have a breakpoint in each of the following processors in Example 1, it shows each Function is called for each incoming Notification. Whereas in Example 2, you only call the process2 Function if the predicate logic is met.
Example 1
#Bean
public Function<KStream<String, Notification>,KStream<String, Notification>> process1() {
return input -> input
.branch(PREDICATE_FOR_OUT_0, PREDICATE_FOR_OUT_1);
}
#Bean
public Function<KStream<String, Notification>,KStream<String, EnrichedNotification>> process2() {
return input -> input
.filter(PREDICATE_FOR_OUT_2);
.map((key, value) ->.........; //different additional processing to map to EnrichedNotification type
}
There is no need for the following and attempt to route the output of one processor into another? (Not sure that it is even possible)
Example 2 (conceptual)
I am probably thinking this way because I am coming from using pure Kafka. Here process1 has a 3 way branch. Two of the branches go to their respective stream and then topic, but the third requires further processing before it can be routed to a topic.
#Bean
public Function<KStream<String, Notification>,KStream<String, Notification>[]> process1() {
return input -> input
.branch(PREDICATE_FOR_OUT_0, PREDICATE_FOR_OUT_1, PREDICATE_FOR_OUT_2);
}
Could we potentially route the branch for PREDICATE_FOR_OUT_2 into process2. This would mean process2 would only be called if PREDICATE_FOR_OUT_2 was met
#Bean
public Function<KStream<String, Notification>,KStream<String, EnrichedNotification>> process2() {
return input -> input
.map((key, value) ->.........; //different additional processing to map to EnrichedNotification type
}
My thinking is example 2 is redundant (and not actually possible anyway) due to the abstraction and functionality that Kafka Streams gives
I think both cases of your examples can get the job done, but there are some differences. In the first example, you have two functions, both receiving data from the same Kafka topic and the second function performs some additional logic before getting routed to the output topic. In the second example, you again have two functions. In the first function, you have 3 branches, each of them sending data to a Kafka topic (I assume they are 3 different topics). Then in the second function, you receive data from the 3rd output topic from the first function. After performing the logic in that second function of example 2, you send it to the final destination for this branch. You are introducing an extra topic for this second example. I think your first example is more readable and clean.
I have a FlinkKafkaConsumer defined as follows FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties) and I'm working with event time by using setStreamTimeCharacteristic(TimeCharacteristic.EventTime).
Now I want to assign a periodic watermark with the function assignTimestampsAndWatermarks, but I don't know what I should pass to that function since in the documentation the example of this function receive an element of type MyType with a getCreationTime() and my consumer is of type String.
Is it possible to assign event time in this situation?
EDIT: The time I would want to use as event time is the time each register was stored in Kafka.
The notion of EventTime is at least in the definition strictly connected with the time at which events are created rather than received. So, if the events that You are consuming from Kafka have some kind of timestamp (for example if You are consuming JSON as String and then parsing it) then you can use this timestamp inside the assignTimestampsAndWatermarks function.
If you are parsing plain String objects then the best thing You could do is to use custom KafkaDeserializationSchema to extract Kafka timestamp for each event and use this.
Technically, You could even use the counter that increases artificially timestamp for each record(for example by incrementing it by 1), but this doesn't seem to make sense in terms of EventTime processing.
In a Spring Boot app using Spring Cloud Stream connecting to Kafka, I'm trying to set up two separate stream listener methods:
One reads from topics "t1" and "t2" as KTables, re-partitioning using a different key in one, then joining to data from the other
The other reads from an unrelated topic, "t3", as a KStream.
Because the first listener does some joining and aggregating, some topics are created automatically, e.g. "test-1-KTABLE-AGGREGATE-STATE-STORE-0000000007-repartition-0". (Not sure if this is related to the problem or not.)
When I set up the code by having two separate methods annotated with #StreamListener, I get the error below when the Spring Boot app starts:
Exception in thread "test-d44cb424-7575-4f5f-b148-afad034c93f4-StreamThread-2" java.lang.IllegalArgumentException: Assigned partition t1-0 for non-subscribed topic regex pattern; subscription pattern is t3
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignFromSubscribed(SubscriptionState.java:195)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:225)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:367)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:295)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1146)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1111)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:848)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:805)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:771)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:741)
I think the important part is: "Assigned partition t1-0 for non-subscribed topic regex pattern; subscription pattern is t3". These are the two unrelated topics, so as far as I can see nothing related to t3 should be subscribing to anything related to t1. The exact topic which causes the problem also changes intermittently: sometimes it's one of the automatically generated topics which is mentioned, rather than t1 itself.
Here is how the two stream listeners are set up (in Kotlin):
#StreamListener
fun listenerForT1AndT2(
#Input("t1") t1KTable: KTable<String, T1Obj>,
#Input("t2") t2KTable: KTable<String, T2Obj>) {
t2KTable
.groupBy(...)
.aggregate(
{ ... },
{ ... },
{ ... },
Materialized.with(Serdes.String(), JsonSerde(SomeObj::class.java)))
.join(t1KTable,
{ ... },
Materialized.`as`<String, SomeObj, KeyValueStore<Bytes, ByteArray>>("test")
.withKeySerde(Serdes.String())
.withValueSerde(JsonSerde(SomeObj::class.java)))
}
#StreamListener
fun listenerForT3(#Input("t3") t3KStream: KStream<String, T3Obj>) {
events.map { ... }
}
However, when I set up my code by having just one method annotated with #StreamListener, and take parameters for all three topics, everything works fine, e.g.
#StreamListener
fun compositeListener(
#Input("t1") t1KTable: KTable<String, T1Obj>,
#Input("t2") t2KTable: KTable<String, T2Obj>,
#Input("t3") t3KStream: KStream<String, T3Obj>) {
...
}
But I don't think it's right that I can only have one #StreamListener method.
I know that there is content-based routing for adding conditions to the StreamListener annotation, but if the methods define the input channels then I'm not sure if I need to be using this here - I'd have thought the use of the #Input annotations on the method parameters would be enough to tell the system which channels (and therefore which Kafka topics) to bind to? If I do need to use content-based routing, how can I apply it here to have each method receive only the items from the relevant topic(s)?
I've also tried separating out the two listener methods into two separate classes, each of which has #EnableBinding for only the interface it's interested in (i.e. one interface for t1 and t2, and a separate interface for t3), but that doesn't help.
Everything else I've found related to this error message, e.g. here, is about having multiple app instances, but in my case there's only one Spring Boot app instance.
You need separate application id for each StreamListener methods. Here is an example:
spring.cloud.stream.kafka.streams.bindings.t1.consumer.application-id=processor1-application-id
spring.cloud.stream.kafka.streams.bindings.t2.consumer.application-id=processor1-application-id
spring.cloud.stream.kafka.streams.bindings.t3.consumer.application-id=processor2-application-id
You probably want to test with the latest snapshot (2.1.0) as there were some recent changes with the way application id is processed by the binder.
Please see the update here for more details.
Here is a working sample of multiple StreamListener methods which are Kafka Streams processors.
I want to create an event time clock for my events in Apache flink. I am doing it in following way
public class TimeStampAssigner implements AssignerWithPeriodicWatermarks<Tuple2<String, String>> {
private final long maxOutOfOrderness = 0; // 3.5
private long currentMaxTimestamp;
#Override
public long extractTimestamp(Tuple2<String, String> element, long previousElementTimestamp) {
currentMaxTimestamp = new Date().getTime();
return currentMaxTimestamp;
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}
Please check the above code and tell if I am doing it correctly. After the event time and watermark assignment i want to process the stream in process function in which i will be collecting the stream data for 10 minutes for different keys.
No, this is not an appropriate implementation. An event time timestamp should be deterministic (i.e., reproducible), and it should be based on data in the event stream. If instead you are going to use Date().getTime, then you are more or less using processing time.
Typically when doing event time processing your events will have a timestamp field, and the timestamp extractor will return the value of this field.
The implementation you've shown will lose most of the benefits that come from working with event time, such as the ability to reprocess historic data in order to reproduce historic results.
Your implementation is implementing ingestion time to the Flink system and not the event time. If you consume from Kafka, for example, previousElementTimestamp should normally point to the time where the event has been produced to the Kafka (if nothing else is said by the Kafka producer), which would make your streaming processing reproducible.
If you want to implement event time processing in Flink you should rather use some timestamps associated with your element. Which could be or inside the element itself (which makes sense for time-series) or stored in the Kafka and available under the previousElementTimestamp.
About maxOutOfOrderness you also probably want to consider Flink's side output feature which makes possible to get the late elements after the window creation and update your Flink job's output.
If you consume from Kafka and want just simple with some data loss event time processing implementation go with AscendingTimestampExtractor.
There are some potential problems with a AscendingTimestampExtractor which can appear in case your data are not ordered within the partition or you apply this extractor after the operator and not directly after the KafkaSource.
For the robust industrial use-case you should rather implement Watermark Ingestion into the persistent log storage as mentioned in the Google DataFlow model.