How to get unmatched records from a stream to stream join in ksqldb? - apache-kafka

I'm just starting out in ksqldb and I just can't figure out what is the right flow model for what I'm trying to achieve.
I have a kafka topic that receives several types of events in json format. My requirement is two find pairs from 2 types of events in the topic, join them into a single message and write it back to kafka. Each pair will have the same value in the key field of the kafka message.
The event types are 'meta' and 'data'.
This is pretty straight forward when talking about inner join. I've created a source stream on top of the source topic and then built two streams on top of that stream, where each stream is selecting one of the event types. Afterwards, I've created an inner join CSAS, on top of those two streams and write the paired output message out to a kafka topic. My requirement says I need to find a match within a 1 hour window.
The problem is that I also have to find 'orphaned' or unmatched records, within this same time window and produce a message to kafka (it will have an 'error' type in it's event type field). The message should have data from the 'orphaned' event type, plus a custom header that I need to add.
So basically, I'm talking about a full outer join and I need to get only records that didn't find it's pair, within the 1 hour window.
I'm using ksqldb version 0.18.
So what would be the right way to go about what I'm trying to achieve, without upgrading my ksqldb version?

Related

Get output record partition within Kafka Streams

I have a KStream which branches and writes output records into different topics based on some internal logic. Is there any way I can know the partition of the output message from inside the KStream? The output topics have different number of partitions from the input ones.
When using the high-level DSL, you don't have access to the record metadata (which holds a key/value pair on specific partition the record came from). So you won't be able to achieve the goal using KStream implementation.
You could use the low-level processor API if you wanted, which would allow access to the metadata. Otherwise, you can add an implementation of ConsumerInterceptor, and map the partition value to the message before it goes to the consumer.

how can I aggregate kafka topics?

i need to carry market data from source to target. i'd like to put each symbol , ie. BTCUSD in its own topic and have the target app subscribe to as many topics as it wants and receive data of multiple symbols in correct time based order.
i am currently putting all data into a single topic , and have the target filter out the data it's not interested in.
can i achieve what i want with kafka alone , or with an additional project , or can you name another message broker for the job.
thanks.

Reset Kafka streams applications via code / api

I'm Wunderding what would be the best method to perform this kind of operation with Kafka Streams.
I have one Kafka stream and one KGlobalTable let's say products (1.000.000 msg) and categoriesLogicBlobTable (10 msg).
Every time a new message arrives at the topic categoriesLogicBlobTable I need to reprocess all the products applying the new arrived message to products and the output goes to a third topic.
I was thinking on using the kafka.tools.StreamsResetter logic and hooking on my code in a way that I stop the kafkaStream run the reset and start the stream again.
A Second alternative is to not have kafka streams but only two consumers and one producer. This way I could use the method consumer.seekToBeginning(Collections.emptyList());
Resetting a KafkaStreams application would result in a lot of duplicate output for this case. Assume you have 10 records in the stream and 5 records in the table and while processing you produce 3 output record. Now, you add a 6th record to the table, and re-read the full stream. Thus, you would re-emit the first 3 output record to the output topic, and maybe additional output records if some records also join to the newly added 6th table record. This does not seem like what you want.
I guess you need to use KafkaConsumer/KafkaProducer manually.

Kafka KStream Related Message Events in Sliding Window

We have a situation in which I think Kafka Streams could help, but I cannot find any documentation or examples that show how.
There is one similar question I found, but it does not have any implementation advice: Kafka Streams wait function with depending objects
What I want to do:
I would like to correlate related records from a Kafka topic into a single object and publish that new object to a separate output topic. For example, there might be five message records that are related to each other by a unique key - I want to build a new object from those related objects, and produce it to a new topic.
I want all related events within a sliding window of one hour to be aggregated. In other words, as soon as a message A with ID “123” arrives at the consumer, the application must wait at least one hour for the remaining records with ID “123” to arrive. After all records have arrived or one hour has passed, these records are expired.
Finally, all related messages collected over the hour are used to create a new object, which is then sent to another Kafka topic.
Problems I have encountered.
The sliding window in Kafka seems only to work when joining two streams together. We will only have one stream connected to the topic - I do not know why there are two streams required or how we would go about implementing this. I cannot find any examples of this online.
All of the stream functions I’ve seen in Kafka simply aggregate / reduce to a simple value when collecting events of the same key. For example, the number of times a key appears or adding up some value
Here is some pseudo-code to describe what I am talking about. The function names/semantics are going to be different if the functionality exists.
KStream<Key, Object> kstream = kStreamBuilder.stream(TOPIC);
kstream.windowedBy(
// One hour sliding Window
)
.collectAllRelatedKeys(
// Collect all Records related to each key
// map == HashMap<Key, ArrayList<Value>>
map.get(key).add(value);
)
.transformAndProcess(
if(ALL_EVENTS_COLLECTED) {
// Create new Object from all related records
newObject =
createNewObjectFromRelatedRecordsFunction(map.get(key));
producer.send(newObject);
}
)
Questions (And Thank you For Helping):
How could I use sliding windows with a single stream?
How do I customize KStream/KTable functions to collect all related events within the time window and produce the new object to another topic?
How does acknowledging / offset management work with sliding window streams?
Could this guarantee Exactly Once delivery? For reference: https://www.confluent.io/blog/enabling-exactly-kafka-streams/
Sliding window support for aggregation was added in Apache Kafka 2.7.
Cf https://issues.apache.org/jira/browse/KAFKA-5636

How to merge multiple kafka streams in order to do a session windowing over all events of the resulting stream

We have multiple input topics with different business events (page views, clicks, scroll events etc). As far as I understood Kafka streams they all get an event timestamp, which can be used for KStream joins with other streams or tables to align the times.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
This should by possible by using groupByKey and then aggregate/reduce (specifying the Inactivity time here) on a stream containing all events. This combined stream must have all events from the different input topics in an order of the event time (or in a way that the above kafka streams methods honor this event times).
The only challenge that is left, is to create this combined / merged stream.
When I look at the Kafka Streams API, there is the KStreamBuilder#merge operation for which the javadoc says: There is no ordering guarantee for records from different {#link KStream}s.. Does this mean the Session Windowing will produce incorrect results?
If yes, what is the alternative to #merge?
I was also thinking about joining, but in fact it seems to depend if you have one event per topic per ID, or potentially multiple events with the same ID within one input topic. For the first case, joining is a good strategy but not for the later, as you would get some unnecessary duplication.
stream A: <a,1> <a,2>
stream B: <a,3>
join-output plus session: <a,1-3 + 2-3>
Number 3 would be a duplicate.
Also keep in mind, that joining slightly modifies the time stamps and thus your session windows might be different if you apply them on the join result or on the raw data.
About merge() and ordering. You can use merge() safely as the session windows will be build based on record timestamp and not offset-order. And all window operations in Kafka Streams can handle out-of-order data gracefully.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
From what I understand, you'd need to join the streams (and use groupBy to ensure that they can be properly joined by user id), not merge them. You can then follow-up with an session-windowed aggregation.