Read Data, Hold Data for N Seconds, Write Data (Kafka, Flink)

Read Data, Hold Data for N Seconds, Write Data (Kafka, Flink) - scala

Application Reads from a kafka topic.
Each message must be unique (Duplicates Ignored)
holds the data for 'N' Seconds
and writes to different kafka topic as individual messages
Is there a way to hold the message for 'N seconds' and write to kafka
Each message must be written to the same topic after 'N' Seconds from the time it came in.
Currently I'm holding the data in a json structure in memory and every time a message comes in, I loop through all the messages that I have and compare times.
Naturally this is not the way to do it
val some_consumer= new FlinkKafkaConsumer09(data_topic
, new JSONKeyValueDeserializationSchema(false), properties)
some_consumer.setStartFromLatest()
val in_stream = env.addSource(some_consumer)
.filter(!_.isNull)
.map(x => processMessage(x))
def process(x: ObjectNode){
// store message in json if not existing
// loop through entire set and compare times
// if after 'N' seconds
// write to kafka
kafka_producer.send(new ProducerRecord[String, String](output_topic, the_unique_message))
}

You should hold the messages in Flink state, so that they are checkpointed, and will be restored in the case of failures.
To de-duplicate the stream, you can key the stream by whatever attribute makes an event unique, i.e., keyBy(x -> x.uniqueId). Then I would use a KeyedProcessFunction, and buffer the first event for each key in a ValueState<Event>. You can use either an EventTimeTimer or a ProcessingTimeTimer to trigger sending out the event (whichever is appropriate). If the scope of de-duplication is N seconds, then you can clear the state at the same time you emit the event.

You can use Tumbling Windows
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#tumbling-windows
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
The above example means the data goes out each 5 seconds, and you can see it clearly when printing to the console
in your case you don't need EventTime and can use ProcessingTime.
Also you don't need keyBy(), just use AllWindow, although it's not a bad idea to use keyBy() so you obtain parallelisms
after window(), you can call FlinkKafkaSink. Because this window would periodically emit events each X minutes/seconds as you wish
You might be careful about the memory limit, because the data which is kept in the window is stored in memory

Related

Batching Kafka Events using Faust

I have a Kafka topic we will call ingest that receives an entry every x seconds. I have a process that I want to run on this data but it has to be run on 100 events at a time. Thus, I want to batch the entries together and send them to a new topic called batched-ingest. The two topics will look like this...
ingest = [entry, entry, entry, ...]
batched-ingest = [[entry_0, entry_1, ..., entry_99]]
What is the correct way to do this using faust? The solution I have right now is this...
app = faust.App("explore", value_serializer="raw")
ingest = app.topic('ingest')
ingest_batch = app.topic('ingest-batch')
#app.agent(ingest, sink=[ingest_batch])
async def test(stream):
async for values in stream.take(10, within=1000):
yield values
I am not sure if this is the correct way to do this in Faust. If so, what should I set within to in order to make it always wait until len(values) = 100?

as mentioned in the faust take documentation if you omit the within from take(100, within=10) the code will block forever if there are 99 messages and the last hundredth message is never received. To solve this add a within timeout so that up to 100 values will be processed within 10 seconds. so that if there are periods of 10 seconds with no events received it will still process what it has gathered.

Kafka Streams / How to get the partition an iterartor is iterating over?

in my Kafka Streams application, I have a task that sets up a scheduled (by the wall time) punctuator. The punctuator iterates over the entries of a store and does something with them. Like this:
var store = context().getStateStore("MyStore");
var iter = store.all();
while (iter.hasNext()) {
var entry = iter.next();
// ... do something with the entry
}
// Print a summary (now): N entries processed
// Print a summary (wish): N entries processed in partition P
Since I'm working with a single store here (which might be partitioned), I assume that every single execution of the punctuator is bound to a single partition of that store.
Is it possible to find out which partition the punctuator operates on? The java docs for ProcessorContext.partition() states that this method returns -1 within punctuators.
I've read Kafka Streams: Punctuate vs Process and the answers there. I can understand that a task is, in general, not tied to a particular partition. But an iterator should be tied IMO.
How can I find out the partition?
Or is my assumption that a particular instance of a store iterator is tied to a partion wrong?
What I need it for: I'd like to include the partition number in some log messages. For now, I have several nearly identical log messages stating that the punctuator does this and that. In order to make those messages "unique" I'd like to include the partition number into them.

Just to post here the answer that was provided in https://issues.apache.org/jira/browse/KAFKA-12328:
I just used context.taskId(). It contains the partition number at the end of the value, after the underscore. This was sufficient for me.

Window does not assess elements from Kafka Source

I think my perception of Flink windows may be wrong, since they are not evaluated as I would expect from the documentation or the Flink book. The goal is to join a Kafka topic, which has rather static data, with a Kafka topic with constantly incoming data.
env.addSource(createKafkaConsumer())
.join(env.addSource((createKafkaConsumer()))))
.where(keySelector())
.equalTo(keySelector())
.window(TumblingProcessingTimeWindows.of(Time.hours(2)))
.apply(new RichJoinFunction<A, B>() { ... }
createKafkaConsumer() returns a FlinkKafkaConsumer
keySelector() is a placeholder for my key selector.
KafkaTopic A has 1 record, KafkaTopic B has 5. My understanding would be, that the JoinFunction is triggered 5 times (join condition is valid each time), resulting in 5 records in the sink. If a new record for topic A comes in within the 2 hours, another 5 records would be created (2x5 records). However, what comes through in the sink is rather unpredictable, I could not see a pattern. Sometimes there's nothing, sometimes the initial records, but if I send additional messages, they are not being processed by the join with prior records.
My key question:
What does even happen here? Are the records emitted after the window is done processing? I would expect a real-time output to the sink, but that would explain a lot.
Related to that:
Could I handle this problem with onElement trigger or would this make my TimeWindow obsolete? Do those two concepts exists parallel to each other, i.e. that the join window is 2 hours, but the join function + output is triggered per element? How about duplicates in that case?
Subsequently, does processing time mean the point in time, when the record is consumed from the topic? So if I e.g. setStartFromEarliest() on start, all messages which were consumed within the next two hours, were in that window?
Additional info:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); is set and I also switched to EventTime in between.

The semantics of a tumbling processing time window is that it processes all events which fall into the given timespan. In your case, it is 2 hours. Per default, the window will only output results once the 2 hours are over because it needs to know that no other events will be coming for this window.
If you want to output early results (e.g. for every incoming record), then you could specify a custom Trigger which fires on every element. See the Trigger API docs for more information about this.
Update
The window time does not start with the first element but the window starts at multiples of the window length. For example, if your window size is 2 hours, then you can only have windows [0, 2), [2, 4), ... but not [1, 3), [3, 5).

How to schedule periodical task based on number of processed messages?

I want to use Kafka Processor API to process messages from Kafka.
I would like to call some periodically function - something like:
context.schedule(IntervalMS,punctuationType, somePunctuator), where somePunctuator perform some periodical job, but instead using interval time as trigger I would like to invoke that task after processing some number of messages
Is it possible do such triggering in Kafka streams?

yes, it's possible with using Kafka Streams State Store.
logic depends on what exactly you need to do on reaching the number of processed messages.
if you need to propagate data to the next processor or sink node, you need to store aggregated values as a list of objects inside key-value state store. inside Processor.process(..) you put data into key-value store, and after that check whether number of items reached limit, and do required logic (like processorContext.forward(..)). please take a look at similar example here.
if you need to do some logic after reaching number and don't need values, you could store only counter, and inside Processor.process(..) increment this value.

Need advice on storing time series data in aligned 10 minute batches per channel

I have time series data in Kafka. The schema is quite simple - the key is the channel name, and the values are Long/Double tuples of the timestamp and the value (in reality it's a custom Avro object but it boils down to this). They always come in correct chronological order.
The wanted end result is data packaged in 10 minute batches, aligned at 10 minutes (i.e., 00:00 < t <= 00:10, 00:10 < t <= 00:20, ..., 23: 50 < t <= 00:00). Each package is to contain only data of one channel.
My idea is to have two Spark Streaming jobs. The first one takes the data from the Kafka topics and dumps it to a table in a Cassandra database where the key is the timestamp and the channel name, and every time such an RDD hits a 10 minute boundary, this boundary is posted to another topic, alongside the channel whose boundary is hit.
The second job listens to this "boundary topic", and for every received 10 minute boundary, the data is pulled from Cassandra, some calculations like min, max, mean, stddev are done and the data and these results are packaged to a defined output directory. That way, each directory contains the data from one channel and one 10 minute window.
However, this looks a bit clunky and like a lot of extra work to me. Is this a feasible solution or are there any other more efficient tricks to it, like some custom windowing of the Kafka data?

I agree with your intuition that this solution is clunky. How about simply using the time windowing functionality built into the Streams DSL?
http://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing
The most natural output would be a new topic containing the windowed aggregations, but if you really need it written to a directory that should be possible with Kafka Connect.

I work with the Flink Stream Processing, not Spark-streaming but I guess the programming concept of both of them is alike. So supposing data are ordered chronologically and you want to aggregate data for every 10 minutes and do some processing on aggregated data, I think the best approach is to use the Streaming Window Functions. I suggest to define a function to map every incoming data's timestamp to the last 10 minutes:
12:10:24 ----> 12:10:00
12:10:30 ----> 12:10:00
12:25:24 ----> 12:20:00
So you can create a keyed stream object like:
StreamObject<Long, Tuple<data>>
That the Long field is the mapped timestamp of every message. Then you can apply a window. You should search what kind of window is more appropriate for your case.
Point: Setting a key for the data stream will cause the window function to consider a logical window for every key.
In the most simple case, you should define a time window of 10 minutes and aggregate all data incoming on that period of time.
The other approach, if you know the rate of generating of data and how many messages will be generated in a period of 10 minutes, is to use Count window. For example, a window with the count of 20 will listen to the stream and aggregate all the messages with the same key in a logical window and apply the window function just when the number of messages in the window reaches 20.
After messages aggregated in a window as desired, you could apply your processing logic using a reduce function or some action like that.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Read Data, Hold Data for N Seconds, Write Data (Kafka, Flink) - scala

Related

Batching Kafka Events using Faust

Kafka Streams / How to get the partition an iterartor is iterating over?

Window does not assess elements from Kafka Source

How to schedule periodical task based on number of processed messages?

Need advice on storing time series data in aligned 10 minute batches per channel

Categories

Resources