Initial load of Kafka stream data with windowed join - apache-kafka

I am using a Windowed Join between two streams, let's say a 7 day window.
On initial load, all records in the DB (via kafka connect source connector) are being loaded to the streams. It seems then that ALL records end up in the window state store for those first 7 days as the producer/ingested timestamps are all in current time vs. a field (like create_time) that might be in the message value.
Is there a recommended way to balance the initial load against the Windows of the join?

Well, the question is what records do you want to join to each other? And what timestamp the source connector sets as record timestamp (might also depend on the topic configuration, [log.]message.timestamp.type.
The join is executed based on whatever the TimestampExtractor returns. By default, that is the record timestamp. If you want to base the join on some other timestamp, a custom timestampe extractor is the way to go.
If you want to get processing time semantics, you may want to use the WallclockTimestampExtractor though.

Related

How do you get the latest offset from a remote query to a Table in ksqlDB?

I have an architecture where I would like to query a ksqlDB Table from a Kafka stream A (created by ksqlDB). On startup, Service A will load in all the data from this table into a hashmap, and then afterward it will start consuming from Kafka Stream A and act off any events to update this hashmap. I want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the table, and when I started consuming off Kafka Stream A. Is there a way that I can retrieve the latest offset that my query to the table is populated by so that I can use that offset to start consuming from Kafka Stream A?
Another thing to mention is that we have hundreds of instances of our app going up and down so reading directly off the Kafka stream is not an option. Reading an entire stream worth of data every time our apps come up is not a scalable solution. Reading in the event streams data into a hashmap on the service is a hard requirement. This is why the ksqlDB table seems like a good option since we can get the latest state of data in the format needed and then just update based off of events from the stream. Kafka Stream A is essentially a CDC stream off of a MySQL table that has been enriched with other data.
You used "materialized view" but I'm going to pretend I
heard "table". I have often used materialized views
in a historical reporting context, but not with live updates.
I assume that yours will behave similar to a "table".
I assume that all events, and DB rows, have timestamps.
Hopefully they are "mostly monotonic", so applying a
small safety window lets us efficiently process just
the relevant recent ones.
The crux of the matter is racing updates.
We need to prohibit races.
Each time an instance of a writer, such as your app,
comes up, assign it a new name.
Rolling a guid is often the most convenient way to do that,
or perhaps prepend it with a timestamp if sort order matters.
Ensure that each DB row mentions that "owning" name.
want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the materialized view, and when I started consuming off Kafka Stream A.
We will need a guaranteed monotonic column with an integer ID
or a timestamp. Let's call it ts.
Query m = max(ts).
Do a big query of records < m, slowly filling your hashmap.
Start consuming Stream A.
Do a small query of records >= m, updating the hashmap.
Continue to loop through subsequently arriving Stream A records.
Now you're caught up, and can maintain the hashmap in sync with DB.
Your business logic probably requires that you
treat DB rows mentioning the "self" guid
in a different way from rows that existed
prior to startup.
Think of it as de-dup, or ignoring replayed rows.
You may find offsetsForTimes() useful.
There's also listOffsets().

Is it possible to detect and drop duplicate data using ksql

I have a simple question whether can we detect and drop duplicates in streaming data on kafka topic using KSQL.
By default, tables are de-duped on keys. A new record for the same key will overwrite old events. If you need to "detect" and "process" the old data, as new events come in, then KSQL cannot do this.
If you need distinct values rather than by-key, you can create a table against some stream of events and filtering on HAVING COUNT(field) = 1 over a time window, which is the best you can do there. Ref - https://kafka-tutorials.confluent.io/finding-distinct-events/ksql.html
If you need indefinite time windows to ensure you only process a certain field once, then you'll want to use an external database, and optionally an internal cache, to perform lookups against. This would need to be done with a regular consumer, or Kafka Streams.

Synchronize Data From Multiple Data Sources

Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.
We are at the design phase and the current system design is as follows:
The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
Each data source has its own queue (Kafka Topic).
From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
Depending upon the feature set, an inference engine will subscribe to
multiple Kafka topics and stream data from those topics to continuously output the inference.
The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.
Problem:
In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized.
So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.
Question
We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?
I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.
There are 3 ways to define Windows in KSQL:
hopping windows, tumbling windows, and session windows. Hopping and
tumbling windows are time windows, because they're defined by fixed
durations they you specify. Session windows are dynamically sized
based on incoming data and defined by periods of activity separated by
gaps of inactivity.
In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,
SELECT t1.id, ...
FROM topic_1 t1
INNER JOIN topic_2 t2
WITHIN 1 HOURS
ON t1.id = t2.id;
Some suggestions -
Handle delay at the producer end -
Ensure all three producers always send data in sync to Kafka topics by using batch.size and linger.ms.
eg. if linger.ms is set to 1000, all messages would be sent to Kafka within 1 second.
Handle delay at the consumer end -
Considering any streaming engine at the consumer side (be it Kafka-stream, spark-stream, Flink), provides windows functionality to join/aggregate stream data based on keys while considering delayed window function.
Check this - Flink windows for reference how to choose right window type link
To handle this scenario, data sources must provide some mechanism for the consumer to realize that all relevant data has arrived. The simplest solution is to publish a batch from data source with a batch Id (Guid) of some form. Consumers can then wait until the next batch id shows up marking the end of the previous batch. This approach assumes sources will not skip a batch, otherwise they will get permanently mis-aligned. There is no algorithm to detect this but you might have some fields in the data that show discontinuity and allow you to realign the data.
A weaker version of this approach is to either just wait x-seconds and assume all sources succeed in this much time or look at some form of time stamps (logical or wall clock) to detect that a source has moved on to the next time window implicitly showing completion of the last window.
The following recommendations should maximize success of event synchronization for the anomaly detection problem using timeseries data.
Use a network time synchronizer on all producer/consumer nodes
Use a heartbeat message from producers every x units of time with a fixed start time. For eg: the messages are sent every two minutes at the start of the minute.
Build predictors for producer message delay. use the heartbeat messages to compute this.
With these primitives, we should be able to align the timeseries events, accounting for time drifts due to network delays.
At the inference engine side, expand your windows at a per producer level to synch up events across producers.

Need advice on storing time series data in aligned 10 minute batches per channel

I have time series data in Kafka. The schema is quite simple - the key is the channel name, and the values are Long/Double tuples of the timestamp and the value (in reality it's a custom Avro object but it boils down to this). They always come in correct chronological order.
The wanted end result is data packaged in 10 minute batches, aligned at 10 minutes (i.e., 00:00 < t <= 00:10, 00:10 < t <= 00:20, ..., 23: 50 < t <= 00:00). Each package is to contain only data of one channel.
My idea is to have two Spark Streaming jobs. The first one takes the data from the Kafka topics and dumps it to a table in a Cassandra database where the key is the timestamp and the channel name, and every time such an RDD hits a 10 minute boundary, this boundary is posted to another topic, alongside the channel whose boundary is hit.
The second job listens to this "boundary topic", and for every received 10 minute boundary, the data is pulled from Cassandra, some calculations like min, max, mean, stddev are done and the data and these results are packaged to a defined output directory. That way, each directory contains the data from one channel and one 10 minute window.
However, this looks a bit clunky and like a lot of extra work to me. Is this a feasible solution or are there any other more efficient tricks to it, like some custom windowing of the Kafka data?
I agree with your intuition that this solution is clunky. How about simply using the time windowing functionality built into the Streams DSL?
http://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing
The most natural output would be a new topic containing the windowed aggregations, but if you really need it written to a directory that should be possible with Kafka Connect.
I work with the Flink Stream Processing, not Spark-streaming but I guess the programming concept of both of them is alike. So supposing data are ordered chronologically and you want to aggregate data for every 10 minutes and do some processing on aggregated data, I think the best approach is to use the Streaming Window Functions. I suggest to define a function to map every incoming data's timestamp to the last 10 minutes:
12:10:24 ----> 12:10:00
12:10:30 ----> 12:10:00
12:25:24 ----> 12:20:00
So you can create a keyed stream object like:
StreamObject<Long, Tuple<data>>
That the Long field is the mapped timestamp of every message. Then you can apply a window. You should search what kind of window is more appropriate for your case.
Point: Setting a key for the data stream will cause the window function to consider a logical window for every key.
In the most simple case, you should define a time window of 10 minutes and aggregate all data incoming on that period of time.
The other approach, if you know the rate of generating of data and how many messages will be generated in a period of 10 minutes, is to use Count window. For example, a window with the count of 20 will listen to the stream and aggregate all the messages with the same key in a logical window and apply the window function just when the number of messages in the window reaches 20.
After messages aggregated in a window as desired, you could apply your processing logic using a reduce function or some action like that.

Oracle change-data-capture with Kafka best practices

I'm working on a project where we need to stream real-time updates from Oracle to a bunch of systems (Cassandra, Hadoop, real-time processing, etc). We are planing to use Golden Gate to capture the changes from Oracle, write them to Kafka, and then let different target systems read the event from Kafka.
There are quite a few design decisions that need to be made:
What data to write into Kafka on updates?
GoldenGate emits updates in a form of record ID, and updated field. These changes can be writing into Kafka in one of 3 ways:
Full rows: For every field change, emit the full row. This gives a full representation of the 'object', but probably requires making a query to get the full row.
Only updated fields: The easiest, but it's kind of a weird to work with as you never have a full representation of an object easily accessible. How would one write this to Hadoop?
Events: Probably the cleanest format ( and the best fit for Kafka), but it requires a lot of work to translate db field updates into events.
Where to perform data transformation and cleanup?
The schema in the Oracle DB is generated by a 3rd party CRM tool, and is hence not very easy to consume - there are weird field names, translation tables, etc. This data can be cleaned in one of (a) source system, (b) Kafka using stream processing, (c) each target system.
How to ensure in-order processing for parallel consumers?
Kafka allows each consumer to read a different partition, where each partition is guaranteed to be in order. Topics and partitions need to be picked in a way that guarantees that messages in each partition are completely independent. If we pick a topic per table, and hash record to partitions based on record_id, this should work most of the time. However what happens when a new child object is added? We need to make sure it gets processed before the parent uses it's foreign_id
One solution I have implemented is to publish only the record id into Kafka and in the Consumer, use a lookup to the origin DB to get the complete record. I would think that in a scenario like the one described in the question, you may want to use the CRM tool API to lookup that particular record and not reverse engineer the record lookup in your code.
How did you end up implementing the solution ?