Kafka How to process non-duplicate and ordered messages - apache-kafka

can anyone please help. I have below requirement.
Requirement: Process non-duplicate, order chat-messages and make them a bundle per ProgramUserId, and here is the process and involved topics.
Data set up:
ProgramUserId can have any number of messages but each message is unique and has a composite key: MsgId + Action. So imagine data in kafka like this below.
P2->M3+A1 , P2->M2+A1 , P2->M1+A1 , P1->M3+A1 , P1->M2+A2 , P1->M2+A1 , P1->M1+A1
I am doing this right now:
Initial-Topic: (Original key: ProgramUserId)
1)From Initial-Topic --> consume Kstream ( with re-keying to : Msg Id + Action ) --> then write to topic : dedup-Topic
From dedup-Topic --> consume Kstream (with re-keying back to original key : ProgramUserId ) --> write to topic: Final-Topic
Since we are re-keying at the dedup-topic, the message's order will messup because rekeying results re-partinitong hence no gaurantee in the order.
I added below logic to achieve deduplication :
From dedup-topic create Ktable and Postgres table(using Sink connect). For each incoming message, check key (Msg Id + Action) in both Ktable and PG table.
If a record not found, that means it's not duplicated and write that record to dedup-topic.
But with the above message order is messing up due to rekeying /re-partitioning in dedup-Topic.
Please help how to achieve ordered msgs at this point?

Related

How to ensure that in one Kafka topic same key goes to same partition for multiple tables

I have a requirement to produce data from multiple MongoDB tables and push to the same Kafka Topic using the mongo-kafka connector. Also I have to ensure that the data for the same table key columns always go to the same partition every time to ensure message ordering.
For example :
tables --> customer , address
table key columns -->CustomerID(for table customer) ,AddressID(for table address)
For CustomerID =12345 , it will always go to partition 1
For AddressID = 54321 , it will always go to partition 2
For a single table , the second requirement is easy to achieve using chained transformations. However for multiple tables->1 topic , finding it difficult to achieve since each of these tables has different key column names.
Is there any way available to fulfil both requirements using the Kafka connector?
If you use ExtractField$Key transform and IntegerConverter, all matching IDs should go to the same partition.
If you have two columns and one table, or end up with keys like {"CustomerID": 12345} then you have a composite/object key, meaning the whole key will be hashed when used to compute partitioning, not the ID itself.
You cannot set partition for specific fields within any record without setting producer.override.partitioner.class in Connector config. In other words, you need to implement a partitioner that will deserialize your data, parse the values, then compute and return the respective partition.

Kafka Stream using same predicate to send the message to multiple branches [duplicate]

I have a single source CSV file containing records of different sizes that pushes every record into one source topic. I want to split the records into different KStreams/KTables from that source topic. I have a pipeline for one table load, where I am pushing the record from the source topic into stream1 in delimited format and then pushing the records into another stream in AVRO format which is then pushed into JDBC sink connector that pushes the record into MySQL database. The pipeline needs to be the same. But I wanted to push records of different tables into one source topic and then split the records into the different streams as per one value. Is this possible? I tried searching for ways to do that but could not. Can I improve the pipeline somehow too or use KTable instead of KStreams or any other modifications?
My current flow -
one source CSV file (source.csv) -> source topic (name - sourcetopic containing test1 records) -> stream 1 (delimited value format) -> stream 2 (as AVRO value format) -> end topic (name - sink-db-test1) -> JDBC sink connector -> MySQL DB (name - test1)
I have a different MySQL table test2 with a different schema and the records for this table are also present in source.csv file. Since the schema is different I cannot follow the current pipeline of test1 to insert data into the test2 table.
Example -
in CSV source file,
line 1 - 9,atm,mun,ronaldo
line 2- 10,atm,mun,bravo,num2
line 3 - 11,atm,sign,bravo,sick
here in this example, the value under which it is to be split is column 4 (ronaldo or bravo)
all these data should be loaded into table 1, table 2, table 3 respectively
The key is the column 4.
if col4==ronaldo, go to table 1
if col4==bravo and col3==mun, go to table 2
if col4==bravo and col3 ==sign go to table 3
I am very new to Kafka, started Kafka development from the previous week.
You can write a separated Kafka Streams application to split records from the input topic to different KStream or output topics using KStream#branch() operator:
KStream<K, V>[] branches = streamsBuilder.branch(
(key, value) -> {filter logic for topic 1 here},
(key, value) -> {filter logic for topic 2 here},
(key, value) -> true//get all messages for this branch
);
// KStream branches[0] records for logic 1
// KStream branches[1] records for logic 2
// KStream branches[2] records for logic 3
Or you could manually branch your KStream like this:
KStream<K, V> inputKStream = streamsBuilder.stream("your_input_topic", Consumed.with(keySerde, valueSerdes));
inputKStream
.filter((key, value) -> {filter logic for topic 1 here})
.to("your_1st_output_topic");
inputKStream
.filter((key, value) -> {filter logic for topic 2 here})
.to("your_2nd_output_topic");
...
I am able to split the data and used KSQL for the approach that I am sharing below.
1. An input stream is created with value_format='JSON' and a column payload as STRING
2. The payload will contain the whole record as a STRING
3. The record is then split into different streams using LIKE operator in the WHERE clause while putting the payload into different streams as per the requirement. Here I have used SPLIT operator of KSQL to get the records from payload that are in comma-delimited format

How to split records into different streams, from one topic to different streams?

I have a single source CSV file containing records of different sizes that pushes every record into one source topic. I want to split the records into different KStreams/KTables from that source topic. I have a pipeline for one table load, where I am pushing the record from the source topic into stream1 in delimited format and then pushing the records into another stream in AVRO format which is then pushed into JDBC sink connector that pushes the record into MySQL database. The pipeline needs to be the same. But I wanted to push records of different tables into one source topic and then split the records into the different streams as per one value. Is this possible? I tried searching for ways to do that but could not. Can I improve the pipeline somehow too or use KTable instead of KStreams or any other modifications?
My current flow -
one source CSV file (source.csv) -> source topic (name - sourcetopic containing test1 records) -> stream 1 (delimited value format) -> stream 2 (as AVRO value format) -> end topic (name - sink-db-test1) -> JDBC sink connector -> MySQL DB (name - test1)
I have a different MySQL table test2 with a different schema and the records for this table are also present in source.csv file. Since the schema is different I cannot follow the current pipeline of test1 to insert data into the test2 table.
Example -
in CSV source file,
line 1 - 9,atm,mun,ronaldo
line 2- 10,atm,mun,bravo,num2
line 3 - 11,atm,sign,bravo,sick
here in this example, the value under which it is to be split is column 4 (ronaldo or bravo)
all these data should be loaded into table 1, table 2, table 3 respectively
The key is the column 4.
if col4==ronaldo, go to table 1
if col4==bravo and col3==mun, go to table 2
if col4==bravo and col3 ==sign go to table 3
I am very new to Kafka, started Kafka development from the previous week.
You can write a separated Kafka Streams application to split records from the input topic to different KStream or output topics using KStream#branch() operator:
KStream<K, V>[] branches = streamsBuilder.branch(
(key, value) -> {filter logic for topic 1 here},
(key, value) -> {filter logic for topic 2 here},
(key, value) -> true//get all messages for this branch
);
// KStream branches[0] records for logic 1
// KStream branches[1] records for logic 2
// KStream branches[2] records for logic 3
Or you could manually branch your KStream like this:
KStream<K, V> inputKStream = streamsBuilder.stream("your_input_topic", Consumed.with(keySerde, valueSerdes));
inputKStream
.filter((key, value) -> {filter logic for topic 1 here})
.to("your_1st_output_topic");
inputKStream
.filter((key, value) -> {filter logic for topic 2 here})
.to("your_2nd_output_topic");
...
I am able to split the data and used KSQL for the approach that I am sharing below.
1. An input stream is created with value_format='JSON' and a column payload as STRING
2. The payload will contain the whole record as a STRING
3. The record is then split into different streams using LIKE operator in the WHERE clause while putting the payload into different streams as per the requirement. Here I have used SPLIT operator of KSQL to get the records from payload that are in comma-delimited format

Kafka Streams as table Patch log not full Post

Desired functionality: For a given key, key123, numerous services are running in parallel and reporting their results to a single location, once all results are gathered for key123 they are passed to a new downstream consumer.
Original idea: Using AWS DynamoDB to hold all results for a given entry. Every time a result is ready a micro-service does a PATCH operation to the database on key123. An output stream checks each UPDATE to see if the entry is complete, if so, it is forwarded downstream.
New Idea: Use Kafka Streams and KSQL to reach the same goal. All services write their output to the results topic, the topic forms a change log Kstream that we KSQL query for completed entries. Something like:
CREATE STREAM competed_results FROM results_stream SELECT * WHERE (all results != NULL).
The part I'm not sure how to do is the PATCH operation on the stream. To have the output stream show the accumulation of all messages for key123 instead of just the most recent one?
KSQL users, does this even make sense? Am I close to a solution that someone has done before?
If you can produce all your events to the same topic, with the key set, then you can collect all of the events for a specific key using an aggregation in ksqlDB such as:
CREATE STREAM source (
KEY INT KEY, -- example key to group by
EVENT STRING -- example event to collect
) WITH (
kafka_topic='source', -- or whatever your source topic is called.
value_format='json' -- or whatever value format you need.
);
CREATE TABLE agg AS
SELECT
key,
COLLECT_LIST(event) as events
FROM source
GROUP BY key;
This will create a changelog topic called AGG by default. As new events are received for a specific key on the source topic, ksqlDB will produce messages to the AGG topic, with the key set to key and the value containing the list of all the events seen for that key.
You can then import this changelog as a stream:
CREATE STREAM agg_stream (
KEY INT KEY,
EVENTS ARRAY<STRING>
) WITH (
kafka_topic='AGG',
value_format='json'
);
And you can then apply some criteria to filter the stream to only include your final results:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE ARRAY_LEN(EVENTS) = 5; -- example 'complete' criteria.
You may even want to use a user-defined function to define your complete criteria:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE IS_COMPLETE(EVENTS);

Apache Kafka Streams Interactive Queries - How to create a store where the value is an entity and not an aggregation

I have a topic which receives events with the following info:
key -> orderId (Integer)
value -> {"orderId" : aaa, "productId" : xxx, "userId" : yyy, "state" : "zzz"} (JSON with the whole info of the order)
I want to implement a interactive query to get the full order information by orderId. The idea is be able to get the current state of an order from a materialized view (the Kafka Streams store).
First I create the KStream of the topic:
KStream<Integer, JsonNode> stream = kStreamBuilder.stream(integerSerde, jsonSerde, STREAMING_TOPIC);
Then I create a KTable to assign it to a store. The problem is that apparently I can only create stores where the value is an aggregation, for instance: stream.groupByKey().count("myStore");
The store I need should have the whole order information, not an aggregation. Is this possible?
You can read the topic directly as a KTable, too:
KTable<Integer, JsonNode> stream = kStreamBuilder.table(integerSerde, jsonSerde, STREAMING_TOPIC, "store-name-for-IQ");
This FAQ might also help: http://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step