I have some historical data, each record has their timestamp. I would like to read them and feed them into kafka topic, and use kafka stream to process them in a time windowed manner.
Now the question is, when I create kafka stream time windowed aggregation processor, how can I tell kafka to use timestamp field in the record to create time window, instead of real live time?
You need to create a custom TimestampExtractor that will extract the value from the record itself - there's an example of this in the documentation, and here too. I also found this gist which looks relevant.
Related
I am trying to tweak my windowing parameter in my streaming Beam pipeline. The parameters that I am modifying are withAllowedLateness, triggers, interval, pane-firing, etc.
However I don't know how to trigger lateness in my Kafka consuming pipeline to test the changes. Can anybody suggest how to create event lateness?
Thanks
Do you use kafka published time as the window time or custom field?
Most of the time we are doing the window on custom date field (which most cases makes more sense, since you want to group on some logical time, in cases the publishers has some issues and it also publish messages with some delay) and then it's very easy to simulate "late data" just by sending events with custom date field contains some past date time.
Do you use order messages when consuming the data? if so you can continue publish data to your kafka topic and not reading it at all. then start the Beam job when you have huge backlog, most times when there is a backlog, messages are read not in order and it cause more data to arrive after the window is closed, which is late data.
I have N Kafka topic, with data and a timestamp, I need to combine them in a single topic with sorted timestamp order, where the data is sorted inside the partition. I got one way to do that.
Combine all the Kafka topic data in Cassandra(because of its fast write) with clustering order as DESCENDING, it will combine them all but the limit would be if after a timed window of accumulation of data if a data came late, it won't be sorted
Is there any other appropriate way to do that? If not then is there any chance of improvement in my solution.
Thanks
Not clear why you need Kafka to sort on timestamps. Typically this is done only at consumption time for each batch of messages.
For example, create Kafka Streams process that reads from all topics. Create a Global KTable and enable Interactive Querying.
When you query, then you sort the data on the client side, regardless of how it is ordered in the topic.
This way, you are no limited to a single, ordered partition.
Alternatively, I would write to something other than Cassandra (due to my lack of deep knowledge of it). For example, Couchbase or CockroachDB.
Then when you query those later, run a SORT BY
I have a topic wherein I get a burst of events from various devices. There are n number of devices which emit weather report every s seconds.
The problem is that these devices emit 5-10 records of the same value every s seconds. So if you see the output in the kafka topic for a single device, it is as follows:-
For device1:-
t1,t1,t1,t1(in the same moment, then gap of s seconds)t2,t2,t2,t2(then gap of s seconds),t3,t3,t3,t3
However, I want to remove these duplicate records in kafka that come as burst of events.
I want to consume as follows:-
t1,t2,t3,...
I was trying to use concepts of windowing and ktable that Kafka stream API provide, but it doesn't seem possible. Any ideas?
You might want to use Kafka's Log compaction. But in order to use it you should have the same key for all the duplicated messages, and a different key for non duplicate messages. Have a look at this.
https://kafka.apache.org/documentation/#compaction
Would it be an option to read the topic into a KTable using t as the key. The duplicated values would be treated as upserts rather than inserts which would effectively drop them. Then write the KTable into another topic
Step 1:
Produce the same key with all messages, that are logically duplicates.
Step 2:
If you don't need near real-time processing with this topic as an input, use cleanup.policy=compact. It will produce "eventual" deduplication (may be delayed for a long time).
Otherwise, use exactly-once kafka streams deduplication. Here are DSL and Transformer examples.
maybe this is a beginner question but what is the recommended way to read data produced in KSQL?
Let's assume I do some stream processing and write the data to a KSQL table. Now I want to access this data via a Spring application (e.g. fan-out some live data via a websocket). My first guess here was to use Spring Kafka and just subscribe to the underlying topic. Or should I use Kafka Streams?
Another use-case could be to do stream processing and write the results to a Redis store (e.g. for a webservice which always returns current values). What would be the approach here?
Thanks!
The results if KSQL queries are stored in Kafka topics. So you can access the results from third party applications by reading from the result topic.
If the query result is a Table the resulted Kafka topic is a changelog topic meaning that you can read it into a table in third party system such as Cassandra or Redis. This table will always have the latest result and you can query it from web services.
Check out our Clickstream demo where we push the results into Elastic for visualization. The visualized values are the latest values for in the corresponding tables.
https://github.com/confluentinc/ksql/tree/master/ksql-clickstream-demo#clickstream-analysis
Is there an elegant way to query a Kafka topic for a specific record? The REST API that I'm building gets an ID and needs to look up records associated with that ID in a Kafka topic. One approach is to check every record in the topic via a custom consumer and look for a match, but I'd like to avoid the overhead of reading a bunch of records. Does Kafka have a fast, built in filtering capability?
The only fast way to search for a record in Kafka (to oversimplify) is by partition and offset. The new producer class can return, via futures, the partition and offset into which a message was written. You can use these two values to very quickly retrieve the message.
So if you make the ID out of the partition and offset then you can implement your fast query. Otherwise, not so much. This means that the ID for an object isn't part of your data model, but rather is generated by the Kafka-knowledgable code.
Maybe that works for you, maybe it doesn't.
This might be late for you, but it will help for how other see this question, now there is KSQL, kafka sql is an open-source streaming SQL engine
https://github.com/confluentinc/ksql/