Join records coming at different times to Kafka streams - apache-kafka

Have the requirements as this. Records in the control table (database table) with customer number and promotion code as key is published in topic say T1. Records in email table (database table) with customer number as key is published in a topic say T2. Matching records for customer number might be published in the topics T1 and T2 at different times.
Solution attempted
Part 1 - Read T1 as stream and T2 as GlobalKTable. Join the two using the below snippet. This works fine when a new record is pushed in T1 and a look up happens in T2 using the customer number as foreign key
T1.join(T2, T1.customerNumber, joiner).
Part 2 - Read T2 as stream and T1 as GlobalKTable. Trying to join the records does not work here.
T2.join(T1, cannot build key for T1 from T2 record here, joiner). The solution breaks here.
Both the above parts form the complete solution. The solution would make sure that even if records are published at different times a lookup in GlobalKTable will get a matching record.
Is this the right solution for the requirement or is their a better solution. Currently part 2 of the solution is not working. Please advise.

Related

Generating a materialized join table with a many-to-one column in ksqldb

thanks in advance for any input!
I have the requirement of retrieving data from 4 of the below databases via 1 HTTP request. I've chosen to create a materialized table with KSQLDB which will contain all relevant data from the 4 database tables. My API Gateway will then query that table using KSQLDB's rest API.
My struggle is in creating 1 KSQLDB table to show information for a purchase order webpage which consists of data from all 4 of the below services shown below:
vendor_tbl
contract_tbl (has vendorId column referencing vendor_table's PK)
services_tbl (has contractId column referencing contract_table's PK)
purchase_order_tbl (has both vendorId && contractId columns referencing vendor/contract table's PKs)
The issue lies mainly with the services table, because it has a many-to-one relationship with contracts. 1 service is for 1 contract, but 1 contract can have many services, which is common.
The structure as is works perfectly in the RDBMS context, but I'm struggling to find any lane using KSQLDB to create 1 materialized table containing all of the services ID's for the PO, along with data from corresponding, contract, vendor && PO tables...
What I have tried:
1.
I have tried creating 2 streams, 1 to join 2/4 of the streams each and then "daisy-chaining" the 2 joined streams as such:
CREATE STREAM po_vendor_join AS SELECT * FROM po_stream p INNER JOIN vendors_tbl v ON p.vendorId = v.id;
CREATE STREAM service_contract_join AS SELECT * FROM services_stream s INNER JOIN contracts_tbl c ON s.contractId = c.id;
This works. But of course there are duplicate entries in the service_contract_join stream, but the next step anyway is to create a table from these 2 streams, but I cannot do this because of the aggregation/group by requirements, which require that every column in the table be part of the aggregation. I understand that I would have duplicates in OTHER columns (vendor PK, contract PK is referenced in multiple tables for instance) but the PK itself is UNIQUE, and needs to be the id of the services_tbl in this case, since there are multiple for a PO. (Note that I have also tried OUTER LEFT && RIGHT joins on the streams)
2.
I have tried "staggering" between table/stream, as such:
CREATE TABLE vendors_tbl (id VARCHAR PRIMARY KEY, name VARCHAR) WITH (KAFKA_TOPIC='vendors', VALUE_FORMAT='json', PARTITIONS=1);
CREATE STREAM contracts_stream (id VARCHAR, vendorId VARCHAR) WITH (KAFKA_TOPIC='contracts', VALUE_FORMAT='json', PARTITIONS=1);
And then rendering a joined stream between the table/stream like so:
CREATE STREAM po_vendor_join
AS SELECT p.*, v.*
FROM po_stream p
INNER JOIN vendors_tbl v ON p.vendorId = v.id;
But then when trying to join the stream, of course I am met with the same restrictions as in attempt # 1.
3.
I have tried making all 4 of the services tables, and simply modeling them as non-queryable tables.
The problem with this approach arises when I try to create a join table between 2 of the tables, like below:
CREATE TABLE PO_HOLLISTIC_AGGREGATE
AS
SELECT * FROM services_contracts_join_tbl scj
INNER JOIN po_vendors_join_tbl pvj ON scj.c_vendorId = pvj.v_id;
I receive the error here stating that the join needs to be on the PK of the right table, but again, the PK required here would be of the services table because there are multiple.
This makes me conclude that the only way this would be possible with KSQLDB is if I stored the service PK on the other 3 streams, which wouldn't be doable either really because of the aggregation restrictions when creating a table via joined streams.
I'd appreciate any ideas, thanks again!

Tableau joining multiple tables

Hi, I have trouble understanding the structure of this table connection.
Let's suppose that all joins are inner join. Does this picture mean:
Orders JOIN (Orders1 JOIN People) JOIN Returns?
or
Orders JOIN Orders1 JOIN (People JOIN Returns)?
I don't understand
Why Orders1 and People are both vertically aligned and both connected with Orders table.
(As my understanding, join operation is bilateral, not trilateral. My imagination is that the joining should be all represented horizontally, looking like a chain.)
I know SQL, it would be easier to explain if write a pseudo SQL script.
The example you have taken is not very good to understand the joins/relationships in tableau.
A wire/line between two tables indicate the join on two tables with some id column (one to many OR one to one OR many to many). You can check the relationship based on which field (read column) by clicking that relationship thread(line). Why I termed this example not a good one because ORDERS table joining itself may have umpteen options. You can edit that relationship in many ways.
So, in you example, ORDERS is joined with itself (ORDERS1). (the result will depend on relationship type of course). Simultaneously ORDERS is joined with PEOPLE table. Since these tables have only one field in common, this relationship has resulted in creation of just one extra column in ORDERS result. NoW PEOPLE is also connected with RETURNS where no column is common so I am not able to understand this relationship.
A watch of this 5 minute video is recommended.
Translating this relationship will be something like..
(ORDERS JOIN (PEOPLE JOIN RETURNS)) JOIN ORDERS1
(Last JOIN outside braces is on the result of braces but with field/fields from ORDERS)

Join returns duplicate rows

I am learning PostgreSQL using the Sakila database. There's a table called actor that has an actor_id, first_name and last_name. There is another table that has actors mapped to films through an actor_id and film_id combination.
I expect the following query to return one row for each actor with the maximum value for film_id for that actor, but I am getting multiple rows instead of one (the maximum of film_id for that actor).
SELECT actor.first_name, actor.last_name, MAX(film_actor.film_id)
FROM actor
LEFT JOIN film_actor ON film_actor.actor_id = actor.actor_id
GROUP BY film_actor.film_id, actor.first_name, actor.last_name
ORDER BY film_actor.film_id;
I appreciate your help in understanding how to get this right using joins (I already have the solution to achieve this using a sub-query).
PS: I am sure this question is often asked by beginners to SQL, but I have not seen an answer that works yet.
You must remove film_actor.film_id from:
GROUP BY film_actor.film_id, actor.first_name, actor.last_name
because you want to group by actor only.
So change to this:
GROUP BY actor.actor_id, actor.first_name, actor.last_name
I added also actor.actor_id just in case there are 2 actors with the same name.
and change the ORDER BY clause to:
ORDER BY MAX(film_actor.film_id)

KSQL - Determining When a Table Is Loaded

How can I determine when KSQL has fully loaded my data from a Kafka topic into my table?
GOAL: Take 2 Kafka topics, join them and write the results to a new Kafka topic.
EXAMPLE:
I am using Ksql's Rest API to issue the following commands.
CREATE TABLE MyTable (A1 VARCHAR, A2 VARCHAR) WITH (kafka_topic='topicA', key='A1', value_format='json');
CREATE STREAM MyStream (B1 varchar, B2 varchar) WITH (kafka_topic='topicB', value_format='json');
CREATE STREAM MyDestination WITH (Kafka_topic='topicC', PARTITIONS = 1, value_format='json') AS SELECT a.A1 as A1, a.A2 as A2, b.B1 as B1, b.B2 as B2 FROM MyStream b left join MyTable a on a.A1 = b.B1;
PROBLEM: topicC only has data from topicB, and all joined values are null.
Although I receive back a status of SUCCESS from the create table command, it appears that the data has not fully loaded into the table. Consequently the result of the 3rd command only has data from the stream and does not include data from the table. If I artificially delay before executing the join command, then the resulting topic will correctly have data from both topics. How can I determine when my table is loaded, and it is safe to execute the join command?
This is indeed a great question. At this point KSQL doesn't have a way to automatically execute a stream-table join only once the table is fully loaded. This is indeed a useful feature. A more general and related problem is discussed here: https://github.com/confluentinc/ksql/issues/1751
Tables in KSQL (and underlying Kafka Streams) have a time dimension, ie, the evolve over time. For a stream-table join, each stream-record is joined with the "correct" table version (ie, tables are versioned by time).
In upcoming CP 5.1 release, you can "pre-load" the table, by ensuring that all record timestamp of the table topic are smaller than the record timestamps of the stream topic. This tells, KSQL, that it needs to process the table topic data first, but advance the table timestamp-version accordingly before it can start joining.
For more details, check out: https://www.confluent.io/resources/streams-tables-two-sides-same-coin

Two Kafka Stream Ktable Join operation emitting message twice

I am trying to join two Ktable streams and it seems that as an output of JOIN operation I am getting the same message as an output twice . Seems value Joiner is invoked twice during this operation .
Let me know how this can be addressed so that only a single message is emitted as an output of Join operation.
KTable<ID, Message> joinedMsg = msg1.join(msg2, new MsgJoiner());
I receive two identical messages as a result of JOIN between two KTables (msg1 and msg2) .
This behaviour is noticed usually when caching is enabled.
If there are updates to the same key in both tables, each table is flushed independently, and therefore each table will trigger the join, so you get two results for the same key.
i.e. There are two tables : table1 and table2. Following is the input data received in table1 and table2:
table1 A:1
table2 A:A
When the stores are flushed on the commit interval. it flushes the store for table1, triggers the join and produces A:1:A. Then it will flush table2, triggers the join and produce A:1:A
You can try disabling cache by setting cache.max.bytes.buffering=0.
P.S. There is already an open issue in KTable/KTable joins.