Two Kafka Stream Ktable Join operation emitting message twice - apache-kafka

I am trying to join two Ktable streams and it seems that as an output of JOIN operation I am getting the same message as an output twice . Seems value Joiner is invoked twice during this operation .
Let me know how this can be addressed so that only a single message is emitted as an output of Join operation.
KTable<ID, Message> joinedMsg = msg1.join(msg2, new MsgJoiner());
I receive two identical messages as a result of JOIN between two KTables (msg1 and msg2) .

This behaviour is noticed usually when caching is enabled.
If there are updates to the same key in both tables, each table is flushed independently, and therefore each table will trigger the join, so you get two results for the same key.
i.e. There are two tables : table1 and table2. Following is the input data received in table1 and table2:
table1 A:1
table2 A:A
When the stores are flushed on the commit interval. it flushes the store for table1, triggers the join and produces A:1:A. Then it will flush table2, triggers the join and produce A:1:A
You can try disabling cache by setting cache.max.bytes.buffering=0.
P.S. There is already an open issue in KTable/KTable joins.

Related

Can't we join two tables and fetch data in Kafka?

I have joined two tables and fetched data using Postgres source connector. But every time it gave the same issue i.e.
I have run the same query in Postgres and it runs without any issue. Is fetching data by joining tables not possible in Kafka?
I solve this issue by using the concept of the subquery. The problem was that
when I use an alias, the alias column is interpreted as a whole as a column name and therefore the problem occurs. Here goes my query:
select * from (select p.\"Id\" as \"p_Id\", p.\"CreatedDate\" as p_createddate, p.\"ClassId\" as p_classid, c.\"Id\" as c_id, c.\"CreatedDate\" as c_createddate, c.\"ClassId\" as c_classid from \"PolicyIssuance\" p left join \"ClassDetails\" c on p.\"DocumentNumber\" = c.\"DocumentNo\") as result"

Join records coming at different times to Kafka streams

Have the requirements as this. Records in the control table (database table) with customer number and promotion code as key is published in topic say T1. Records in email table (database table) with customer number as key is published in a topic say T2. Matching records for customer number might be published in the topics T1 and T2 at different times.
Solution attempted
Part 1 - Read T1 as stream and T2 as GlobalKTable. Join the two using the below snippet. This works fine when a new record is pushed in T1 and a look up happens in T2 using the customer number as foreign key
T1.join(T2, T1.customerNumber, joiner).
Part 2 - Read T2 as stream and T1 as GlobalKTable. Trying to join the records does not work here.
T2.join(T1, cannot build key for T1 from T2 record here, joiner). The solution breaks here.
Both the above parts form the complete solution. The solution would make sure that even if records are published at different times a lookup in GlobalKTable will get a matching record.
Is this the right solution for the requirement or is their a better solution. Currently part 2 of the solution is not working. Please advise.

Try to understand MVCC

I'm trying to understand MVCC and can't get it. For example.
Transaction1 (T1) try to read some data row. In that same time T2 update the same row.
The flow of transaction is
T1 begin -> T2 begin -> T2 commit -> T1 commit
So the first transaction get it snapshot of database and returns to user result on which he is gonna build other calculation. But as I understand, customer get the old data value? As I understand MVCC, after T1 transaction begins, that transaction doesn't know, that some other transaction change data. So if now user doing some calculation after that (without DB involved), he is doing it on wrong data? Or I'm not right and first transaction have some mechanisms to know that row was changed?
Let's now change the transaction flow.
T2beg -> T1beg -> T2com -> T1com
In 2PL user get the newer version of data because of locks (T1 must wait before exclusive lock released). But in case of MVCC it still will be the old data, as I understand the postgresql MVCC model. So I can get stale data in exchange for speed. Am I right? Or I miss something?
Thank you
Yes, it can happen that you read some old data from the database (that a concurrent transaction has modified), perform a calculation based on this and store “outdated” data in the database.
This is not a problem, because rather than the actual order in which transactions happen, the logical order is more relevant:
If T1 reads some data, then T2 modifies the data, then T1 modifies the database based on what it read, you can say that T1 logically took place before T2, and there is no inconsistency.
You only get an anomaly if T1 and T2 modify the same data: T1 reads data, T2 modifies the data, then T1 modifies the same data based on what it read. This anomaly is called a “lost update”.
Lost updates can only occur with the weakest (which is the default) isolation level, READ COMMITTED.
If you want better isolation from concurrent activity, you need to use at least REPEATABLE READ isolation. Then T1 would receive a serialization error when it tries to update the data, and you would have to repeat the transaction. On the second attempt, it would read the new data, and everything will be consistent.
The flow of transaction is next T1 begin -> T2 begin -> T2 commit -> T1 commit. So the first transaction get it snapshot of database and returns to user result on which he is gonna build other calculation. But as I understand, customer get the old data value?
Unless T1 and T2 try to update the same row, this could usually be reordered to be the same as: T1 begin; T1 commit; T2 begin; T2 commit;
In other words, any undesirable business outcome that could be achieved by T2 changing the data while T1 was making a decision, could have also occurred by T2 changing the data immediately after T1 made the same decision.
Each transaction can only see data that is younger or equal to that transaction's ID.
When transaction 1 reads data, it marks the read time stamp of that data to transaction 1.
If transaction 2 tries to read the same data, it checks the read timestamp of the data, if the read timestamp for the data is less than transaction 2, then transaction 2 is aborted because 1 < 2 -- 1 got there before us and they must finish before us.
At commit time, we also check if the read timestamp of the data is less than the committing transaction. If it is, we abort the transaction and restart with a new transaction ID.
We also check if write timestamps are less than our transaction. Younger transaction wins.
There is an edge case where a younger transaction can abort if an older transaction gets ahead of the younger transaction.
I have actually implemented MVCC in Java. (see transaction, runner and mvcc code)

KSQL - Determining When a Table Is Loaded

How can I determine when KSQL has fully loaded my data from a Kafka topic into my table?
GOAL: Take 2 Kafka topics, join them and write the results to a new Kafka topic.
EXAMPLE:
I am using Ksql's Rest API to issue the following commands.
CREATE TABLE MyTable (A1 VARCHAR, A2 VARCHAR) WITH (kafka_topic='topicA', key='A1', value_format='json');
CREATE STREAM MyStream (B1 varchar, B2 varchar) WITH (kafka_topic='topicB', value_format='json');
CREATE STREAM MyDestination WITH (Kafka_topic='topicC', PARTITIONS = 1, value_format='json') AS SELECT a.A1 as A1, a.A2 as A2, b.B1 as B1, b.B2 as B2 FROM MyStream b left join MyTable a on a.A1 = b.B1;
PROBLEM: topicC only has data from topicB, and all joined values are null.
Although I receive back a status of SUCCESS from the create table command, it appears that the data has not fully loaded into the table. Consequently the result of the 3rd command only has data from the stream and does not include data from the table. If I artificially delay before executing the join command, then the resulting topic will correctly have data from both topics. How can I determine when my table is loaded, and it is safe to execute the join command?
This is indeed a great question. At this point KSQL doesn't have a way to automatically execute a stream-table join only once the table is fully loaded. This is indeed a useful feature. A more general and related problem is discussed here: https://github.com/confluentinc/ksql/issues/1751
Tables in KSQL (and underlying Kafka Streams) have a time dimension, ie, the evolve over time. For a stream-table join, each stream-record is joined with the "correct" table version (ie, tables are versioned by time).
In upcoming CP 5.1 release, you can "pre-load" the table, by ensuring that all record timestamp of the table topic are smaller than the record timestamps of the stream topic. This tells, KSQL, that it needs to process the table topic data first, but advance the table timestamp-version accordingly before it can start joining.
For more details, check out: https://www.confluent.io/resources/streams-tables-two-sides-same-coin

Ksql: Left Join Displays columns from stream but not tables

I have one steam and a table in KSQL as mentioned below:
Stream name: DEAL_STREAM
Table name: EXPENSE_TABLE
When I run the below queries it displays only columns from the stream but no table columns are being displays.
Is this the expected output. If not am I doing something wrong?
SELECT TD.EXPENSE_CODE, TD.BRANCH_CODE, TE.EXPENSE_DESC
FROM DEAL_STREAM TD
LEFT JOIN EXPENSE_TABLE TE ON TD.EXPENSE_CODE = TE.EXPENSE_CODE
WHERE TD.EXPENSE_CODE LIKE '%NL%' AND TD.BRANCH_CODE LIKE '%AM%';
An output of the query is as shown below.
NL8232##0 | AM | null
NL0232##0 | AM | null
NL6232#!0 | AM | null
NL5232^%0 | AM | null
When I run the below queries it displays only columns from the stream but no table columns are being displays.
In a stream-table (left) join, the output records will contain null columns (for table-side columns) if there is not matching record in the table at the time of the join/lookup.
Is this the expected output. If not am I doing something wrong?
Is it possible that, for example, you wrote the (1) input data into the stream before you wrote (2) the input data into the table? If so, then the stream-table join query would have attempted to perform table-lookups at the time of (1) when no such lookup data was available in the table yet (because that happened later at time (2)). Because there was no such table data available, the join wrote output records where the table-side columns were null.
Note: This stream-table join in KSQL (and, by extension, Apache Kafka's Streams API, on which KSQL is built) is the pretty much the norm for joins in the streaming world. Here, only the stream-side of the stream-table join will trigger downstream join outputs, and if there's no matching for a stream record on the table-side at the time when a new input record is being joined, then the table-side columns will be null. Since this is, however, a common cause of user confusion, we are currently working on adding table-side triggering of join output to Apache Kafka's Streams API and KSQL. When such a feature is available, then your issue above would not happen anymore.