Kafka Streams as table Patch log not full Post - apache-kafka

Desired functionality: For a given key, key123, numerous services are running in parallel and reporting their results to a single location, once all results are gathered for key123 they are passed to a new downstream consumer.
Original idea: Using AWS DynamoDB to hold all results for a given entry. Every time a result is ready a micro-service does a PATCH operation to the database on key123. An output stream checks each UPDATE to see if the entry is complete, if so, it is forwarded downstream.
New Idea: Use Kafka Streams and KSQL to reach the same goal. All services write their output to the results topic, the topic forms a change log Kstream that we KSQL query for completed entries. Something like:
CREATE STREAM competed_results FROM results_stream SELECT * WHERE (all results != NULL).
The part I'm not sure how to do is the PATCH operation on the stream. To have the output stream show the accumulation of all messages for key123 instead of just the most recent one?
KSQL users, does this even make sense? Am I close to a solution that someone has done before?

If you can produce all your events to the same topic, with the key set, then you can collect all of the events for a specific key using an aggregation in ksqlDB such as:
CREATE STREAM source (
KEY INT KEY, -- example key to group by
EVENT STRING -- example event to collect
) WITH (
kafka_topic='source', -- or whatever your source topic is called.
value_format='json' -- or whatever value format you need.
);
CREATE TABLE agg AS
SELECT
key,
COLLECT_LIST(event) as events
FROM source
GROUP BY key;
This will create a changelog topic called AGG by default. As new events are received for a specific key on the source topic, ksqlDB will produce messages to the AGG topic, with the key set to key and the value containing the list of all the events seen for that key.
You can then import this changelog as a stream:
CREATE STREAM agg_stream (
KEY INT KEY,
EVENTS ARRAY<STRING>
) WITH (
kafka_topic='AGG',
value_format='json'
);
And you can then apply some criteria to filter the stream to only include your final results:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE ARRAY_LEN(EVENTS) = 5; -- example 'complete' criteria.
You may even want to use a user-defined function to define your complete criteria:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE IS_COMPLETE(EVENTS);

Related

How to ensure that in one Kafka topic same key goes to same partition for multiple tables

I have a requirement to produce data from multiple MongoDB tables and push to the same Kafka Topic using the mongo-kafka connector. Also I have to ensure that the data for the same table key columns always go to the same partition every time to ensure message ordering.
For example :
tables --> customer , address
table key columns -->CustomerID(for table customer) ,AddressID(for table address)
For CustomerID =12345 , it will always go to partition 1
For AddressID = 54321 , it will always go to partition 2
For a single table , the second requirement is easy to achieve using chained transformations. However for multiple tables->1 topic , finding it difficult to achieve since each of these tables has different key column names.
Is there any way available to fulfil both requirements using the Kafka connector?
If you use ExtractField$Key transform and IntegerConverter, all matching IDs should go to the same partition.
If you have two columns and one table, or end up with keys like {"CustomerID": 12345} then you have a composite/object key, meaning the whole key will be hashed when used to compute partitioning, not the ID itself.
You cannot set partition for specific fields within any record without setting producer.override.partitioner.class in Connector config. In other words, you need to implement a partitioner that will deserialize your data, parse the values, then compute and return the respective partition.

Kafka How to process non-duplicate and ordered messages

can anyone please help. I have below requirement.
Requirement: Process non-duplicate, order chat-messages and make them a bundle per ProgramUserId, and here is the process and involved topics.
Data set up:
ProgramUserId can have any number of messages but each message is unique and has a composite key: MsgId + Action. So imagine data in kafka like this below.
P2->M3+A1 , P2->M2+A1 , P2->M1+A1 , P1->M3+A1 , P1->M2+A2 , P1->M2+A1 , P1->M1+A1
I am doing this right now:
Initial-Topic: (Original key: ProgramUserId)
1)From Initial-Topic --> consume Kstream ( with re-keying to : Msg Id + Action ) --> then write to topic : dedup-Topic
From dedup-Topic --> consume Kstream (with re-keying back to original key : ProgramUserId ) --> write to topic: Final-Topic
Since we are re-keying at the dedup-topic, the message's order will messup because rekeying results re-partinitong hence no gaurantee in the order.
I added below logic to achieve deduplication :
From dedup-topic create Ktable and Postgres table(using Sink connect). For each incoming message, check key (Msg Id + Action) in both Ktable and PG table.
If a record not found, that means it's not duplicated and write that record to dedup-topic.
But with the above message order is messing up due to rekeying /re-partitioning in dedup-Topic.
Please help how to achieve ordered msgs at this point?

Write to an existing topic with ksqlDB

I have three Kafka topics: entity.created, entity.deleted, entity.attribute. They all have some entity ID as the message key and a Protobuf message value.
I'm trying to use ksqlDB to aggregate messages from the first two topics and produce messages to the third one:
-- TODO: this might be possible to do with just one table
CREATE TABLE apples_created AS
SELECT references->basket_id, count(1) AS count
FROM entity_created
WHERE type = 'APPLE'
GROUP BY references->basket_id
EMIT CHANGES;
CREATE TABLE apples_deleted AS
SELECT references->basket_id, count(1) AS count
FROM entity_deleted
WHERE type = 'APPLE'
GROUP BY references->basket_id
EMIT CHANGES;
CREATE TABLE apples_any AS
SELECT apples_created.basket_id AS basket_id, apples_deleted.count IS NULL OR apples_created.count > apples_deleted.count AS has_apples
FROM apples_created
LEFT JOIN apples_deleted ON apples_created.apple_id = apples_deleted.apple_id
EMIT CHANGES;
However, I have found no way to get the output of apples_any to my entity.attribute topic. For example:
CREATE TABLE attr_testing WITH (kafka_topic = 'entity.attribute') AS
SELECT basket_id, 'BASKET' AS type, 'hasApples' AS name, CASE WHEN has_apples THEN 'true' ELSE '' END AS value
FROM apples_any
EMIT CHANGES;
fails, because it tries to register its own schema for the existing topic:
Could not register schema for topic: Schema being registered is incompatible with an earlier schema for subject "entity.attribute-value"; error code: 409
I will have multiple Kafka producers writing to entity.attribute and want ksqlDB to be just one of them. Is writing to an existing topic possible with ksqlDB? If not, I imagine I need a way to copy messages from apples_any (the underlying topic) to entity.attribute. What would be the preferred solution in the Kafka ecosystem?
I'm using ksqlDB 0.17.0.
Edit:
This is the value schema used by the entity.attribute topic:
message EntityAttribute {
EntityType type = 2;
string name = 3;
string value = 4;
}
enum EntityType {
APPLE = 0;
BASKET = 1;
}
Edit 2:
The error in ksqlDB logs I get when I try to create that attr_testing table is:
WARN Found incompatible change: Difference{fullPath='#/EntityAttribute', type=MESSAGE_REMOVED} (io.confluent.kafka.schemaregistry.protobuf.ProtobufSchema)
If I understand correctly, ksqlDB is trying to create a completely new schema, with a different message name. This leads me to believe that whatever schema I use, ksqlDB will ignore it. Also, I have no idea how I could tell ksqlDB to create an enum field (EntityType) instead of a string. That's why I feel like producing to an existing topic is not a ksqlDB's use case...

Kafka Streams - adding message frequency in enriched stream

From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.

How to count unique users in a fixed time window in a Kafka stream app?

We have a kafka message in a sole topic for each event a user does on our platform. Each of the events / kafka messages has a common field userId. We now want to know from that topic how many unique users we had every hour. So we are not interested in the event types and the individual counts for an user. We just want to know how many unique users were active in every hour.
What is the easiest way to achieve this? My current idea seems not to be very simple, see the pseudo code here:
stream
.selectKey() // userId
.groupByKey() // group by userid, results in a KGroupedStream[UserId, Value]
.aggregate( // initializer, merger und accumulator simply deliver a constant value, the message is now just a tick for that userId key
TimeWindows.of(3600000)
) // result of aggregate is KTable[Windowed[UserId], Const]
.toStream // convert in stream to be able to map key in next step
.map() // map key only (Windowed[Userid]) to key = startMs of window to and value Userid
.groupByKey() // grouping by startMs of windows, which was selected as key before
.count() // results in a KTable from startMs of window to counts of users (== unique userIds)
Is there an easier way? I probably overlook something.
There are two thing you can do:
Merge selectKey() and groupByKey() into groupBy()
You don't need the toStream().map() step, but you can do regrouping with a new key directly on the first KTable
Something like this:
stream.groupBy(/* put a KeyValueMapper that return the grouping key */)
.aggregate(... TimeWindow.of(TimeUnit.HOURS.toMillis(1))
.groupBy(/* put a KeyValueMapper that return the new grouping key */)
.count()