How can I simulate event lateness in Apache Beam reading from a Kafka Source - apache-beam

I am trying to tweak my windowing parameter in my streaming Beam pipeline. The parameters that I am modifying are withAllowedLateness, triggers, interval, pane-firing, etc.
However I don't know how to trigger lateness in my Kafka consuming pipeline to test the changes. Can anybody suggest how to create event lateness?
Thanks

Do you use kafka published time as the window time or custom field?
Most of the time we are doing the window on custom date field (which most cases makes more sense, since you want to group on some logical time, in cases the publishers has some issues and it also publish messages with some delay) and then it's very easy to simulate "late data" just by sending events with custom date field contains some past date time.
Do you use order messages when consuming the data? if so you can continue publish data to your kafka topic and not reading it at all. then start the Beam job when you have huge backlog, most times when there is a backlog, messages are read not in order and it cause more data to arrive after the window is closed, which is late data.

Related

Is there a way to tell which event occurred first in two kafka topics

If I have two topics in kafka, is there a way to tell if one event in one topic "occured" before an event in another topic if they both come in within a millisecond of each other ie they have the same timestamp?
Background:
I am building an event sourcing based event drive architecture. Often, when an event occurs in one topic, I need to do a scan to find if a separate event has already occurred in a second topic. Likewise, if the event in the second topic comes in, I need to scan to see if the event in topic one occurred.
In order to not duplicate processing, I need a deterministic way to order the events. If the events are more than 1 millisecond apart, I can just use the timestamp in the event. But, because kafka timestamps only go to the millisecond, when two events occur close together, I can no longer use this approach.
In reality, I don't care which topic occured "first", ie if kafka posted one before another, even if they came in a different order, I don't care. I just need a deterministic way to order them.
In reality, I can use some method, such as arranging the events by topic alphabetically, but was hoping there was a built-in mechanism. (don't want to introduce weird bugs because I always process event A before event B; unlikely, but I've seen it happen)
PS I am open to other ideas. I'm thinking this approach because it was possible in redis streams. However, because of things I can't control, I am restricted to kafka. I do want to avoid using an external data store as then I need to start worrying about data synchronization in there.
You're going to run into synchronization issues, regardless. For example - you could try using a stream-topic join in Kafka Streams. If the event doesn't exist for the join, then it hasn't happened yet, but then you're reliant on having absolutely zero lag in the consumer processes building that KTable.
You could try storing nanoseconds as part of the value or header when you create the record if you need higher precision, but again, you're going to either need absolute zero lag or very precise consumer poll events with some comparison window as Kafka does not provide any processing guarantees across multiple topics

Kafka: Adding batches of old data

Using Kafka for time-based events, using windowing to group events(sessioning) in Kafka Streams.
How should we handle the arrival of a set of data from a different source, which consists of old data?
Say for example, you are doing web analytics for a client.
You receive event data from a client in an event topic, where you receive all event types.
For some reason you did not receive order(purchases) data from a client, you only received pageview data from which you build sessions.
Some time later, you receive a batch of time-based orders, say from the last year, so you can match them with the sessions (associate which sessions are related to which order).
Output of this process is sent to "orders", "pageviews", "sessions", etc. topics.
However, if you just add them to the (end of the) topic, they will be "unordered", so even if you recalculate data, your results will not be correct.
This is somehow similar to the streaming out-of-order events problem, but allowing a much longer time (e.g. a year)
A possibility would be to "delete and rewrite": on a topic with compaction, delete all data from that client, and resending it again in order, and then launch a recalculation for that client.
But that's quite cumbersome.
Is there a better way to handle this?

Classical Architecture to Kafka, how do you realize the following?

we are trying to move away from our classical architecture J2EE application server/Relational database to Kafka. I have an use case that I am not sure how exactly to proceed....
Our application exports with a Scheduler from Relation Database, in the future, we are planning to not to place information at all at Relational Database but to realise export directly from the information at Kafka Topic(s).
What I am not sure will be best solution would be, is to configure consumer that polls the topic(s) with the same schedule as the scheduler and export things.
Or to create KafkaStream at schedule triggering point to collect this information from a Kafka Stream?
What do you think?
The approach you want to adopt is technically feasible, few possible solutions:
1) Continuous running Kafka-Consumer with Duration=<export schedule time>
2) Cron triggered kafka-streaming-consumer with batch-duration same as schedule. Do offset commit to Kafka.
3) Cron triggered Kafka-consumer programmatically handle offsets and pull records based on offsets as per your schedule.
Important considerations:
Increase retention.ms to much more than your schedule batch job time.
Increase disk space to accommodate data volume spike since you are going to hold data for longer duration.
Risks & Issues:
Weekend retention could be missed.
Another application if by mistake uses same group.id can mislead offsets.
No aggregation/math function can be applied before retrieval.
Your application can not filter/extract records based on any parameter.
Unless offsets are managed externally, application can not re-read records.
Records will not be formatted i.e. mostly Json strings or maybe some other formats.

Kafka Streams Sort Within Processing Time Window

I wonder if there's any way to sort records within a window using Kafka Streams DSL or Processor API.
Imagine the following situation as an example (arbitrary one, but similar to what I need):
There is a Kafka topic of some events, let's say user clicks. Let's say topic has 10 partitions. Messages are partitioned by key, but each key is unique, so it's sort of a random partitioning. Each record contains a user id, which is used later to repartition the stream.
We consume the stream, and publish each message to another topic partitioning the record by it's user id (repartition the original stream by user id).
Then we consume this repartitioned stream, and we store consumed records in local state store windowed by 10 minutes. All clicks of a particular user are always in the same partition, but order is not guarantied, because the original topic had 10 partitions.
I understand the windowing model of Kafka Streams, and that time is advanced when new records come in, but I need this window to use processing time, not the event time, and then when window is expired, I need to be able to sort buffered events, and emit them in that order to another topic.
Notice:
We need to be able to flush/process records within the window using processing time, not the event time. We can't wait for the next click to advance the time, because it may never happen.
We need to remove all the records from the store, as soon window is sorted and flushed.
If application crashes, we need to recover (in the same or another instance of the application) and process all the windows that were not processed, without waiting for new records to come for a particular user.
I know Kafka Streams 1.0.0 allows to use wall clock time in Processing API, but I'm not sure what would be the right way to implement what I need (more importantly taking into account the recovery process requirement described above).
You can see my answer to a similar question here:
https://stackoverflow.com/a/44345374/7897191
Since your message keys are already unique you can ignore my comments about de-duplication.
Now that KIP-138 (wall-clock punctuation semantics) has been released in 1.0.0 you should be able to implement the outlined algorithm without issues. It uses the Processor API. I don't know of a way of doing this with only the DSL.

Delayed message consumption in Kafka

How can I produce/consume delayed messages with Apache Kafka? Seems like standard Kafka (and Java kafka-client) functionality doesn't have this feature. I know that I could implement it myself with standard wait/notify mechanism, but it doesn't seem very reliable, so any advices and good practices are appreciated.
Found related question, but it didn't help.
As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
Indeed, kafka lowest structure is a partition, which are sequential events in a queue with incremental offset - you can't insert a log anywhere else than the end at the moment you produce it. There is no concept of delayed messages.
What do you want to achieve exactly?
Some possibilities in your case:
You want to push a message at a specific time (for example, an event "start job"). In this case, use a scheduled task (not from kafka, use some standard way on your os / language / custom app / whatever) to send the message at the given time - consumers will receive them at the proper time.
You want to send an event now, but which should not be taken into account now by consumers. In this case, you can use a custom structure which would include a "time" in its payload. Consumers will have to understand this field and have custom processing to deal with it. For exemple: "start job at 2017-12-27T20:00:00Z". You could also use headers for this, but headers are not supported by all clients for now.
You can change the timestamp of the message sent. Internally, it would still be read in order, but some functions implying time would work differently, and consumer could use the timestamp of the message for its action - this is kinda like the previous proposition, except the timestamp is one metadata of the event, and not the event payload itself. I would not use this personally - I only deal with timestamp when I proxy some events.
For your last question: basically, yes, but with some notes:
Topics are actually split in partition, and order is only preserved in partition. All message with same key are send to same partition.
Most of time, you only read from memory, except if you read old events - in this case, as those are sequentially read from disk, this is very fast
You can choose where to begin to read - a given offset or a given time - and even change it at runtime
You can parallelize read across process - multiple consumers can read the same topics and never reading the same messages twice (each reading different partition, see consumer groups)