How to trigger window if one of multiple Kafka topics are idle - apache-kafka

I'm consuming multiple Kafka topics, windowing them hourly and writing them into separate parquet files for each topic. However, if one of the topics are idle, the window does not get triggered and nothing is written to the FS. For this example, I'm consuming 2 topics with a single partition. taskmanager.numberOfTaskSlots: 2 and parallelism.default: 1. What is the proper way of solving this problem in Apache Beam with Flink Runner?
pipeline
.apply(
"ReadKafka",
KafkaIO
.read[String, String]
.withBootstrapServers(bootstrapServers)
.withTopics(topics)
.withCreateTime(Duration.standardSeconds(0))
.withReadCommitted
.withKeyDeserializer(classOf[StringDeserializer])
.withValueDeserializer(classOf[StringDeserializer])
.withoutMetadata()
)
.apply("ConvertToMyEvent", MapElements.via(new KVToMyEvent()))
.apply(
"WindowHourly",
Window.into[MyEvent](FixedWindows.of(Duration.standardHours(1)))
)
.apply(
"WriteParquet",
FileIO
.writeDynamic[String, MyEvent]()
.by(new BucketByEventName())
//...
)

TimeWindow needs data. If the topic is idle, it means , there is no data to close the Window and the window is open until the data arrives. If you want to window data based on Processing time instead of actual event time , try using a simple process function
public class MyProcessFunction extends
KeyedProcessFunction<KeyDataType,InputDataType,OutputDataType>{
// The data type can be primitive like String or your custom class
private transient ValueState<Long> windowDesc;
#Override
public void open(final Configuration conf) {
final ValueStateDescriptor<Long> windowDesc = new ValueStateDescriptor("windowDesc", Long.class);
this.windowTime = this.getRuntimeContext().getState(windowDesc); // normal variable declaration does not work. Declare variables like this and use it inside the functions
}
#Override
public void processElement(InputType input, Context context, Collector<OutPutType> collector)
throws IOException {
this.windowTime.update( <window interval> ); // milliseconds are recommended
context.timerService().registerProcessingTimeTimer(this.windowTime.value());//register a timer. Timer runs for windowTime from the current time.
.
.
.
if( this.windowTime.value() != null ){
context.timerService().deleteProcessingTimeTimer(this.windowTime.value());
// delete any existing time if you want to reset timer
}
}
#Override
public void onTimer(long timestamp, KeyedProcessFunction<KeyDataType,InputDataType,OutputDataType>.OnTimerContext context,
Collector<OutputType> collector) throws IOException {
//This method is executed when the timer reached
collector.collect( < whatever you want to stream out> );// this data will be available in the pipeline
}
}
```

Related

Apache Flink Stream join using DataStream API not outputting anything

I have 2 streams created using kafka topics and I'm joining them using the DataStream API. I want the results of the join (apply) to be published to another kafka topic. I don't see the results of the join in the out topic.
I confirm I'm publishing proper data to both the source topics. Not sure where it is going wrong. Here is code snippet,
The streams created as shown below.
DataStream<String> ms1=env.addSource(new FlinkKafkaConsumer("top1",new SimpleStringSchema(),prop))
.assignTimestampsAndWatermarks(new WatermarkStrategy() {
#Override
public WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
});
DataStream<String> ms2=env.addSource(new FlinkKafkaConsumer("top2",new SimpleStringSchema(),prop))
.assignTimestampsAndWatermarks(new WatermarkStrategy() {
#Override
public WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
});
Stream joins performed using the join-where-equals, as below.
DataStream joinedStreams = ms1.join(ms2)
.where(o -> {String[] tokens = ((String) o).split("::"); return tokens[0];})
.equalTo(o -> {String[] tokens = ((String) o).split("::"); return tokens[0];})
.window(EventTimeSessionWindows.withGap(Time.seconds(60)))
.apply(new JoinFunction<String, String, CountryData>() {
#Override
public CountryData join(String o, String o2) throws Exception {
String[] tokens1 = o.split("::");
String[] tokens2 = o2.split("::");
CountryData countryData = new CountryData(tokens1[0], tokens1[1], tokens1[2], Long.parseLong(tokens1[3])+Long.parseLong(tokens2[3]));
return countryData;
}});
Added sink as below,
joinedStreams.addSink(new FlinkKafkaProducer<CountryData>("localhost:9095","flink-output", new CustomSchema()));
dataStreamSink.setParallelism(1);
dataStreamSink.name("KAFKA-TOPIC");
Any clue, where it is going wrong? I can see messages available in the topology
Thanks
I think the two FlinkKafkaConsumer instances are missing a time extractor and a watermark configuration.
Since the code is using event-time window join, it needs some kind of time information associated with the data found in Kafka in order to know which time window each events corresponds to.
Without that, events from both streams are probably never close enough in event time to match the 60s window defined by EventTimeSessionWindows.withGap(Time.seconds(60)).
You also need to set the watermark parameter to tell Flink when to stop waiting for new data and materialize the output s.t. you can see the join result.
Have a look at the Kafka connector time and watermark configuration for the various time extraction and watermarking possibilities you have.
Finally, make sure you send test data spread over a long enough time period to your application. With event time processing, only "old enough" data makes it to the output, young data is always "stuck in transit". For example, with 60s time window and, say, 30s watermark, you would need at least 90s of data before you see anything in the output.

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks
What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

Send message to Kafka when SessionWindows was started and ended

I want to send a message to the Kafka topic when new SessionWindow was created and when was ended. I have the following code
stream
.filter(user -> user.isAdmin)
.keyBy(user -> user.username)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)))
//what now? Trigger?
Now I want to send message when new session was started (with some metadata like web browser and timestamps, these informations are available in each element of stream) and send message to Kafka when session was ended (in this example 10 seconds after last element I think) with number of total requests.
It's possible in Flink? I think I should use some trigger but I don't know how and I can't find any example.
If You want to do this when the window is processed, then You can simply use the WindowProcessFunction, basically what You need to do is to add .process(new MyProcessFunction() to Your code. In the ProcessFunction You can have access to the whole window including its first (start) and last (end) element. You can simply use the Side output to just output the beginning and the end of the given window. You can then create a stream from side output and sink it to Kafka. More on Side outputs can be found here.
You can write a custom window trigger.
How to tell a new session is started?
You can create a ValueState with the default value to null, so in case the state value is null, it is a session start.
When the session ended?
Just before TriggerResult.FIRE.
Here is a demo based on ProcessingTimeTrigger of Flink, I only put the question-related logics here, you can check other details from the source code.
public class MyProcessingTimeTrigger extends Trigger<Object, TimeWindow> {
// a state which keeps a session start.
private final ValueStateDescriptor<Long> stateDescriptor = new ValueStateDescriptor<Long>("session-start", Long.class);
#Override
public TriggerResult onElement(Object element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
ValueState<Long> state = ctx.getPartitionedState(stateDescriptor);
if(state.value() == null) {
// if value is null, it's a session start.
state.update(window.getStart());
}
ctx.registerProcessingTimeTimer(window.maxTimestamp());
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) {
// here is a session end.
return TriggerResult.FIRE;
}
#Override
public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
ctx.getPartitionedState(stateDescriptor).clear();
ctx.deleteProcessingTimeTimer(window.maxTimestamp());
}
}

Apache Beam: Error assigning event time using Withtimestamp

I have an unbounded Kafka stream sending data with the following fields
{"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"}
I read the stream using the apache beam sdk for kafka
import org.apache.beam.sdk.io.kafka.KafkaIO;
pipeline.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("kafka:9092")
.withTopic("test")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true"))
.updateConsumerProperties(ImmutableMap.of("group.id", "Consumer1"))
.commitOffsetsInFinalize()
.withoutMetadata()))
Since I want to window using event time ("ts" in my example), i parse the incoming string and assign "ts" field of the incoming datastream as the timestamp.
PCollection<Temperature> tempCollection = p.apply(new SetupKafka())
.apply(ParDo.of(new ReadFromTopic()))
.apply("ParseTemperature", ParDo.of(new ParseTemperature()));
tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));
The window function and the computation is applied as below:
PCollection<Output> output = tempCollection.apply(Window
.<Temperature>into(FixedWindows.of(Duration.standardSeconds(30)))
.triggering(AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10))))
.withAllowedLateness(Duration.standardDays(1))
.accumulatingFiredPanes())
.apply(new ComputeMax());
I stream data into the input stream with a lag of 5 seconds from current utc time since in practical scenrios event timestamp is usually earlier than the processing timestamp.
I get the following error:
Cannot output with timestamp 2019-01-16T11:15:45.560Z. Output
timestamps must be no earlier than the timestamp of the current input
(2019-01-16T11:16:50.640Z) minus the allowed skew (0 milliseconds).
See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing
the allowed skew.
If I comment out the line for AssignTimeStamps, there are no errors but I guess, then it is considering the processing time.
How do I ensure my computation and windows are based on event time and not for processing time?
Please provide some inputs on how to handle this scenario.
To be able to use custom timestamp, first You need to implement CustomTimestampPolicy, by extending TimestampPolicy<KeyT,ValueT>
For example:
public class CustomFieldTimePolicy extends TimestampPolicy<String, Foo> {
protected Instant currentWatermark;
public CustomFieldTimePolicy(Optional<Instant> previousWatermark) {
currentWatermark = previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
}
#Override
public Instant getTimestampForRecord(PartitionContext ctx, KafkaRecord<String, Foo> record) {
currentWatermark = new Instant(record.getKV().getValue().getTimestamp());
return currentWatermark;
}
#Override
public Instant getWatermark(PartitionContext ctx) {
return currentWatermark;
}
}
Then you need to pass your custom TimestampPolicy, when you setting up your KafkaIO source using functional interface TimestampPolicyFactory
KafkaIO.<String, Foo>read().withBootstrapServers("http://localhost:9092")
.withTopic("foo")
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder(KafkaAvroDeserializer.class, AvroCoder.of(Foo.class)) //if you use avro
.withTimestampPolicyFactory((tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))
.updateConsumerProperties(kafkaProperties))
This line is responsible for creating a new timestampPolicy, passing a related partition and previous checkpointed watermark see the documentation
withTimestampPolicyFactory(tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))
Have you had a chance to try this using the time stamp policy, sorry I have not tried this one out myself, but I believe with 2.9.0 you should look at using the policy along with the KafkaIO read.
https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#withTimestampPolicyFactory-org.apache.beam.sdk.io.kafka.TimestampPolicyFactory-

KTable Reduce function does not honor windowing

Requirement :- We need to consolidate all the messages having same orderid and perform subsequent operation for the consolidated Message.
Explanation :- Below snippet of code tries to capture all order messages received from a particular tenant and tries to consolidate to a single order message after waiting for a specific period of time
It does the following stuff
Repartition message based on OrderId. So each order message will be having tenantId and groupId as its key
Perform a groupby key operation followed by windowed operation for 2 minutes
Reduce operation is performed once windowing is completed.
Ktable is converted again to stream back and then its output is send to another kafka topic
Expected Output :- If there are 5 messages having same order id being sent with in window period. It was expected that the final kafka topic should have only one message and it would be the last reduce operation message.
Actual Output :- All the 5 messages are seen indicating windowing is not happening before invoking reduce operation. All the messages seen in kafka have proper reduce operation being done as each and every message is received.
Queries :- In kafka stream library version 0.11.0.0 reduce function used to accept timewindow as its argument. I see that this is deprecated in kafka stream version 1.0.0. Windowing which is done in the below piece of code, is it correct ? Is windowing supported in newer version of kafka stream library 1.0.0 ? If so, then is there something can be improved in below snippet of code ?
String orderMsgTopic = "sampleordertopic";
JsonSerializer<OrderMsg> orderMsgJSONSerialiser = new JsonSerializer<>();
JsonDeserializer<OrderMsg> orderMsgJSONDeSerialiser = new JsonDeserializer<>(OrderMsg.class);
Serde<OrderMsg> orderMsgSerde = Serdes.serdeFrom(orderMsgJSONSerialiser,orderMsgJSONDeSerialiser);
KStream<String, OrderMsg> orderMsgStream = this.builder.stream(orderMsgTopic, Consumed.with(Serdes.ByteArray(), orderMsgSerde))
.map(new KeyValueMapper<byte[], OrderMsg, KeyValue<? extends String, ? extends OrderMsg>>() {
#Override
public KeyValue<? extends String, ? extends OrderMsg> apply(byte[] byteArr, OrderMsg value) {
TenantIdMessageTypeDeserializer deserializer = new TenantIdMessageTypeDeserializer();
TenantIdMessageType tenantIdMessageType = deserializer.deserialize(orderMsgTopic, byteArr);
String newTenantOrderKey = null;
if ((tenantIdMessageType != null) && (tenantIdMessageType.getMessageType() == 1)) {
Long tenantId = tenantIdMessageType.getTenantId();
newTenantOrderKey = tenantId.toString() + value.getOrderKey();
} else {
newTenantOrderKey = value.getOrderKey();
}
return new KeyValue<String, OrderMsg>(newTenantOrderKey, value);
}
});
final KTable<Windowed<String>, OrderMsg> orderGrouping = orderMsgStream.groupByKey(Serialized.with(Serdes.String(), orderMsgSerde))
.windowedBy(TimeWindows.of(windowTime).advanceBy(windowTime))
.reduce(new OrderMsgReducer());
orderGrouping.toStream().map(new KeyValueMapper<Windowed<String>, OrderMsg, KeyValue<String, OrderMsg>>() {
#Override
public KeyValue<String, OrderMsg> apply(Windowed<String> key, OrderMsg value) {
return new KeyValue<String, OrderMsg>(key.key(), value);
}
}).to("newone11", Produced.with(Serdes.String(), orderMsgSerde));
I realised that I had set StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG to 0 and also set the default commit interval of 1000ms. Changing this value helps me to some extent get the windowing working