Ksql: Left Join Displays columns from stream but not tables - apache-kafka

I have one steam and a table in KSQL as mentioned below:
Stream name: DEAL_STREAM
Table name: EXPENSE_TABLE
When I run the below queries it displays only columns from the stream but no table columns are being displays.
Is this the expected output. If not am I doing something wrong?
SELECT TD.EXPENSE_CODE, TD.BRANCH_CODE, TE.EXPENSE_DESC
FROM DEAL_STREAM TD
LEFT JOIN EXPENSE_TABLE TE ON TD.EXPENSE_CODE = TE.EXPENSE_CODE
WHERE TD.EXPENSE_CODE LIKE '%NL%' AND TD.BRANCH_CODE LIKE '%AM%';
An output of the query is as shown below.
NL8232##0 | AM | null
NL0232##0 | AM | null
NL6232#!0 | AM | null
NL5232^%0 | AM | null

When I run the below queries it displays only columns from the stream but no table columns are being displays.
In a stream-table (left) join, the output records will contain null columns (for table-side columns) if there is not matching record in the table at the time of the join/lookup.
Is this the expected output. If not am I doing something wrong?
Is it possible that, for example, you wrote the (1) input data into the stream before you wrote (2) the input data into the table? If so, then the stream-table join query would have attempted to perform table-lookups at the time of (1) when no such lookup data was available in the table yet (because that happened later at time (2)). Because there was no such table data available, the join wrote output records where the table-side columns were null.
Note: This stream-table join in KSQL (and, by extension, Apache Kafka's Streams API, on which KSQL is built) is the pretty much the norm for joins in the streaming world. Here, only the stream-side of the stream-table join will trigger downstream join outputs, and if there's no matching for a stream record on the table-side at the time when a new input record is being joined, then the table-side columns will be null. Since this is, however, a common cause of user confusion, we are currently working on adding table-side triggering of join output to Apache Kafka's Streams API and KSQL. When such a feature is available, then your issue above would not happen anymore.

Related

Unexpected sort order on postgres left outer join

Background
I'm using Postgres 11 and pgAdmin4 v5.2. The problem I describe below is on my dev machine which has both the postgres server and pgAdmin client.
Questions I've looked at on SO that deal with incorrect ordering seem related to collation-related issues with ordering of text fields, whereas my problem is on an integer field.
Setup
I have a table norm_plans that contains ~5k records.
Column | Type
---------------------------------
canon_id | integer
name | character varying(200)
...
other fields
canon_id is autopopulated using a sequence.
I've created a new table norm_plans_cmp as a copy of norm_plans (CREATE TABLE norm_plans_cmp AS TABLE norm_plans WITH DATA;)
I next insert some new records into norm_plans and update some existing records (fields other than canon_id.
The new records increment the sequence and are assigned canon_id values as expected.
I now want to compare norm_plans against norm_plans_cmp so I perform a left outer join:
select a.*, b.*
from norm_plans a
left outer join norm_plans_cmp b
on a.canon_id = b.canon_id
order by a.canon_id
Problem
I would expect records to be sorted by canon_id. This holds true from 1-2000, but after 2,000 I get canon_ids from 5,001 to 5,111 (which is the last canon_id) and then it again picks up from 2,001. I'm viewing this data from pgAdmin, see screenshot 1 below showing the shift from 2,000 to 5,001, and screenshot 2 showing the transition again from 5,111 back to 2,001.
Additional observations
While incorrect, the ordering seems consistent. Running the query multiple times results in the same (incorrect) ordering.
Despite my question title, I'm not totally sure the left join has anything to do with this.
Running SELECT * ... ORDER BY canon_id on norm_plans or norm_plans_cmp alone also result in incorrect ordering, albeit at different points in the order.
Answers to this SO question suggest index corruption may be a contributing problem, but I have no indexes on either norm_plans or norm_plans_cmp (canon_id is not defined as a PK).
At this point, I'm stumped!

Can't we join two tables and fetch data in Kafka?

I have joined two tables and fetched data using Postgres source connector. But every time it gave the same issue i.e.
I have run the same query in Postgres and it runs without any issue. Is fetching data by joining tables not possible in Kafka?
I solve this issue by using the concept of the subquery. The problem was that
when I use an alias, the alias column is interpreted as a whole as a column name and therefore the problem occurs. Here goes my query:
select * from (select p.\"Id\" as \"p_Id\", p.\"CreatedDate\" as p_createddate, p.\"ClassId\" as p_classid, c.\"Id\" as c_id, c.\"CreatedDate\" as c_createddate, c.\"ClassId\" as c_classid from \"PolicyIssuance\" p left join \"ClassDetails\" c on p.\"DocumentNumber\" = c.\"DocumentNo\") as result"

Can I convert from Table to Stream in KSQL?

I am working in the kafka with KSQL. I would like to find out the last row within 5 min in different DEV_NAME(ROWKEY). Therefore, I have created the stream and aggregated table for further joining.
By below KSQL, I have created the table for finding out the last row within 5 min for different DEV_NAME
CREATE TABLE TESTING_TABLE AS
SELECT ROWKEY AS DEV_NAME, max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY ROWKEY;
Then, I would like to join together:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
However, it occured the error:
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (org.apache.kafka.streams.kstream.TimeWindowedSerializer) is not compatible to the actual key type (key type: org.apache.kafka.connect.data.Struct). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
It should be the WINDOW TUMBLING function changed my ROWKEY style
(e.g. DEV_NAME_11508 -> DEV_NAME_11508 : Window{start=157888092000 end=-}
Therefore, without setting the Serdes, could I convert from the table to stream and set the PARTITION BY DEV_NAME?
As you've identified, the issue is that your table is a windowed table, meaning the key of the table is windowed, and you can not look up into a windowed table with a non-windowed key.
You're table, as it stands, will generate one unique row per-ROWKEY for each 5 minute window. Yet it seems like you don't care about anything but the most recent window. It may be that you don't need the windowing in the table, e.g.
CREATE TABLE TESTING_TABLE AS
SELECT
ROWKEY AS DEV_NAME,
max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM
WHERE ROWTIME > (UNIX_TIMESTAMP() - 300000)
GROUP BY ROWKEY;
Will track the max timestamp per key, ignoring any timestamp that is over 5 minutes old. (Of course, this check is only done at the time the event is received, the row isn't removed after 5 minutes).
Also, this join:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
Almost certainly isn't doing what you think and wouldn't work in the way you want due to race conditions.
It's not clear what you're trying to achieve. Adding more information about your source data and required output may help people to provide you with a solution.

KSQL - Determining When a Table Is Loaded

How can I determine when KSQL has fully loaded my data from a Kafka topic into my table?
GOAL: Take 2 Kafka topics, join them and write the results to a new Kafka topic.
EXAMPLE:
I am using Ksql's Rest API to issue the following commands.
CREATE TABLE MyTable (A1 VARCHAR, A2 VARCHAR) WITH (kafka_topic='topicA', key='A1', value_format='json');
CREATE STREAM MyStream (B1 varchar, B2 varchar) WITH (kafka_topic='topicB', value_format='json');
CREATE STREAM MyDestination WITH (Kafka_topic='topicC', PARTITIONS = 1, value_format='json') AS SELECT a.A1 as A1, a.A2 as A2, b.B1 as B1, b.B2 as B2 FROM MyStream b left join MyTable a on a.A1 = b.B1;
PROBLEM: topicC only has data from topicB, and all joined values are null.
Although I receive back a status of SUCCESS from the create table command, it appears that the data has not fully loaded into the table. Consequently the result of the 3rd command only has data from the stream and does not include data from the table. If I artificially delay before executing the join command, then the resulting topic will correctly have data from both topics. How can I determine when my table is loaded, and it is safe to execute the join command?
This is indeed a great question. At this point KSQL doesn't have a way to automatically execute a stream-table join only once the table is fully loaded. This is indeed a useful feature. A more general and related problem is discussed here: https://github.com/confluentinc/ksql/issues/1751
Tables in KSQL (and underlying Kafka Streams) have a time dimension, ie, the evolve over time. For a stream-table join, each stream-record is joined with the "correct" table version (ie, tables are versioned by time).
In upcoming CP 5.1 release, you can "pre-load" the table, by ensuring that all record timestamp of the table topic are smaller than the record timestamps of the stream topic. This tells, KSQL, that it needs to process the table topic data first, but advance the table timestamp-version accordingly before it can start joining.
For more details, check out: https://www.confluent.io/resources/streams-tables-two-sides-same-coin

Two Kafka Stream Ktable Join operation emitting message twice

I am trying to join two Ktable streams and it seems that as an output of JOIN operation I am getting the same message as an output twice . Seems value Joiner is invoked twice during this operation .
Let me know how this can be addressed so that only a single message is emitted as an output of Join operation.
KTable<ID, Message> joinedMsg = msg1.join(msg2, new MsgJoiner());
I receive two identical messages as a result of JOIN between two KTables (msg1 and msg2) .
This behaviour is noticed usually when caching is enabled.
If there are updates to the same key in both tables, each table is flushed independently, and therefore each table will trigger the join, so you get two results for the same key.
i.e. There are two tables : table1 and table2. Following is the input data received in table1 and table2:
table1 A:1
table2 A:A
When the stores are flushed on the commit interval. it flushes the store for table1, triggers the join and produces A:1:A. Then it will flush table2, triggers the join and produce A:1:A
You can try disabling cache by setting cache.max.bytes.buffering=0.
P.S. There is already an open issue in KTable/KTable joins.