Generating a materialized join table with a many-to-one column in ksqldb - apache-kafka

thanks in advance for any input!
I have the requirement of retrieving data from 4 of the below databases via 1 HTTP request. I've chosen to create a materialized table with KSQLDB which will contain all relevant data from the 4 database tables. My API Gateway will then query that table using KSQLDB's rest API.
My struggle is in creating 1 KSQLDB table to show information for a purchase order webpage which consists of data from all 4 of the below services shown below:
vendor_tbl
contract_tbl (has vendorId column referencing vendor_table's PK)
services_tbl (has contractId column referencing contract_table's PK)
purchase_order_tbl (has both vendorId && contractId columns referencing vendor/contract table's PKs)
The issue lies mainly with the services table, because it has a many-to-one relationship with contracts. 1 service is for 1 contract, but 1 contract can have many services, which is common.
The structure as is works perfectly in the RDBMS context, but I'm struggling to find any lane using KSQLDB to create 1 materialized table containing all of the services ID's for the PO, along with data from corresponding, contract, vendor && PO tables...
What I have tried:
1.
I have tried creating 2 streams, 1 to join 2/4 of the streams each and then "daisy-chaining" the 2 joined streams as such:
CREATE STREAM po_vendor_join AS SELECT * FROM po_stream p INNER JOIN vendors_tbl v ON p.vendorId = v.id;
CREATE STREAM service_contract_join AS SELECT * FROM services_stream s INNER JOIN contracts_tbl c ON s.contractId = c.id;
This works. But of course there are duplicate entries in the service_contract_join stream, but the next step anyway is to create a table from these 2 streams, but I cannot do this because of the aggregation/group by requirements, which require that every column in the table be part of the aggregation. I understand that I would have duplicates in OTHER columns (vendor PK, contract PK is referenced in multiple tables for instance) but the PK itself is UNIQUE, and needs to be the id of the services_tbl in this case, since there are multiple for a PO. (Note that I have also tried OUTER LEFT && RIGHT joins on the streams)
2.
I have tried "staggering" between table/stream, as such:
CREATE TABLE vendors_tbl (id VARCHAR PRIMARY KEY, name VARCHAR) WITH (KAFKA_TOPIC='vendors', VALUE_FORMAT='json', PARTITIONS=1);
CREATE STREAM contracts_stream (id VARCHAR, vendorId VARCHAR) WITH (KAFKA_TOPIC='contracts', VALUE_FORMAT='json', PARTITIONS=1);
And then rendering a joined stream between the table/stream like so:
CREATE STREAM po_vendor_join
AS SELECT p.*, v.*
FROM po_stream p
INNER JOIN vendors_tbl v ON p.vendorId = v.id;
But then when trying to join the stream, of course I am met with the same restrictions as in attempt # 1.
3.
I have tried making all 4 of the services tables, and simply modeling them as non-queryable tables.
The problem with this approach arises when I try to create a join table between 2 of the tables, like below:
CREATE TABLE PO_HOLLISTIC_AGGREGATE
AS
SELECT * FROM services_contracts_join_tbl scj
INNER JOIN po_vendors_join_tbl pvj ON scj.c_vendorId = pvj.v_id;
I receive the error here stating that the join needs to be on the PK of the right table, but again, the PK required here would be of the services table because there are multiple.
This makes me conclude that the only way this would be possible with KSQLDB is if I stored the service PK on the other 3 streams, which wouldn't be doable either really because of the aggregation restrictions when creating a table via joined streams.
I'd appreciate any ideas, thanks again!

Related

Best way to join two (or more) kafka topics in KSQL emiting changes from all topics?

We have a "microservices" platform and we are using debezium for change data capture from databases on these platforms which is working nicely.
Now, we'd like to make it easy for us to join these topics and stream the results into a new topic which could be consumed by multiple services.
Disclaimer: this assumes v0.11 ksqldb and cli (seems like much of this might not work in older versions)
Example of two tables from two database instances that stream into Kafka topics:
-- source identity microservice (postgres)
CREATE TABLE public.user_entity (
id varchar(36) NOT NULL,
first_name varchar(255) NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
-- source organization microservice (postgres)
CREATE TABLE public.user_info (
id varchar(36) NOT NULL,
user_entity_id varchar(36) NOT NULL,
business_unit varchar(255) NOT NULL,
cost_center varchar(255) NOT NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
Option 1 : Streams
CREATE STREAM stream_user_info_by_user_entity_id
AS SELECT * FROM stream_user_info
PARTITION BY user_entity_id
EMIT CHANGES;
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN stream_user_info_by_user_entity_id ui WITHIN 365 DAYS ON ue.id = ui.user_entity_id
EMIT CHANGES;
Notice WITHIN 365 DAYS, conceptually these tables could go a very long time without being changed so this window would be technically infinitely large. This looks fishy and seems to hint that this is not a good way to do this.
Option 2 : Tables
CREATE TABLE ktable_user_info_by_user_entity_id (
user_entity_id,
first_name,
business_unit,
cost_center
)
with (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON ue.id = ui.user_entity_id
EMIT CHANGES;
We no longer need the window WITHIN 365 DAYS, so this feels more correct. However this only emits a change when a message is sent to the stream not the table.
In this example:
User updates first_name -> change is emitted
User updates business_unit -> no change emitted
Perhaps there might be a way to create a merged stream partitioned by the user_entity_id and join to child tables which would hold the current state, which leads me to....
Option 3 : Merged stream and tables
-- "master" change stream with merged stream output
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
CREATE STREAM stream_user_entity_by_id
AS SELECT * FROM stream_user_entity
PARTITION BY id
EMIT CHANGES;
CREATE TABLE ktable_user_entity_by_id (
id VARCHAR PRIMARY KEY,
first_name VARCHAR
) with (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
This one looks the best, but appears to have a lot of moving components for each table we have 2 streams, 1 insert query, 1 ktable. Another potential issue here might be a hidden race condition where the stream emits the change before the table is updated under the covers.
Option 4 : More merged tables and streams
CREATE STREAM stream_user_entity_changes_enriched
AS SELECT
ue.id AS user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_by_id ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
CREATE STREAM stream_user_info_changes_enriched
AS SELECT
ui.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_info_by_user_entity_id ui
LEFT JOIN ktable_user_entity_by_id ue ON ui.user_entity_id = ue.id
EMIT CHANGES;
CREATE STREAM stream_user_changes_enriched (user_entity_id VARCHAR, first_name VARCHAR, business_unit VARCHAR, cost_center VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes_enriched', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_entity_changes_enriched;
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_info_changes_enriched;
This is conceptually the same as the earlier one, but the "merging" happens after the joins. Conceivably, this might eliminate any potential race condition because we're selecting primarily from the streams instead of the tables.
The downside is that the complexity is even worse than option 3 and writing and tracking all these streams for any joins with more than two tables would be kind of mind numbing...
Question :
What method is the best for this use case and/or are we attempting to do something that ksql shouldn't be used for? Would we better off to just offload this to traditional RDBMS or spark alternatives?
I'm going to attempt to answer my own question, only accept if upvoted.
The answer is: Option 3
Here are the reasons for this use case this would be the best, while perhaps could be subjective
Streams partitioned by primary and foreign keys are common and simple.
Tables based on these streams are common and simple.
Tables used in this way will not be a race condition.
All options have merits, e.g. if you don't care about emitting all the changes or if the data behaves like streams (logs or events) instead of slow changing dimensions (sql tables).
As for "race conditions", the word "table" tricks your mind that you are actually processing and persisting data. In reality is that they are not actually physical tables, they actually behave more like sub-queries on streams. Note: It might be an exception for aggregation tables which actually produce topics (which I would suggest is a different topic, but would love to see comments)
In the end (syntax may have some slight bugs):
---------------------------------------------------------
-- shared objects (likely to be used by multiple queries)
---------------------------------------------------------
-- shared streams wrapping topics
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
-- shared keyed streams (i like to think of them as "indexes")
CREATE STREAM stream_user_entity_by_id AS
SELECT * FROM stream_user_entity PARTITION BY id
EMIT CHANGES;
CREATE STREAM stream_user_info_by_user_entity_id AS
SELECT * FROM stream_user_info PARTITION BY user_entity_id
EMIT CHANGES;
-- shared keyed tables (inferring columns with schema registry)
CREATE TABLE ktable_user_entity_by_id (id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
CREATE TABLE ktable_user_info_by_user_entity_id (user_entity_id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
---------------------------------------------------------
-- query objects (specific to the produced data)
---------------------------------------------------------
-- "master" change stream (include all tables in join)
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
-- pretty simple looking query
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
The "shared" objects are basically the streaming schema (temptation is to create for all our topics, but that's another question) and the second portion is like the query schema. It ultimately is a functional, clean, and repeatable pattern.
i like your approach number 3. indeed i have tried to use that one to merge streams with different primary keys into one master stream and then grouping them in a materialized view.
the join seems to work, but i ended up having the same situation as in a regular stream-table join ... indeed i see changes in the master stream, but somehow those changes are only triggered downstream (to the group by table) when they affect the first table in the table-stream join and not the others.
So basically what i have achieved is the following:
debezium --> create 3 streams: A,B,AB (AB is a matching table between ids in A and ids in B, used in postgres to make an n-to-n join)
stream A,B,C are repartitioned by one id (A_id) and merged into one stream. in this step all elements of B get assigned a virtual A_id since it is not relevant to them
the 3 KTables are created (i still keep wondering, why? is this a sort of self-join?)
a materialized view (table) is created by grouping the master stream after the joining it to the 3 KTables
when a change in A,B, or AB happens, i see changes in the master stream too, but the materialized view is only updated when changes on stream B occur. Of course destroying the table and recreating it makes it "up-to-date".
are you facing the same problem?

Strategy for selecting records for DataSet

In Most common cases, we have two tables (& more) in DB termed as master (e.g. SalesOrderHeader) & chirld (e.g. SalesOrderDetail).
We can read records from DB by one Select with INNER JOIN and additional constaints WHERE for lessen volume data for loading from DB (using "Addater.Fill(DataSet)")
#"SELECT d.SalesOrderID, d.SalesOrderDetailID, d.OrderQty,
d.ProductID, d.UnitPrice
FROM Sales.SalesOrderDetail d
INNER JOIN Sales.SalesOrderHeader h
ON d.SalesOrderID = h.SalesOrderID
WHERE DATEPART(YEAR, OrderDate) = #year;"
Did I understand right, in this case we receive one table in DataSet, w/o primary and foreign keys, and w/o possibility to set constraint between master and child tables.
This Dataset can be useful only for different queries regarding columns and record which exist in DataSet?
We can't using DbCommandBuilder for creating SQLCommands for Insert, Update, Delete based on the SelectCommand which was used for filling DataSet? And simply to Update data in these table in DB?
If we want to organize the local data moddification in tables by using the disconnect layer of ADO.NET, we must populate DataSet by two Select
"SELECT *
FROM Sales.SalesOrderHeader;"
"SELECT *
FROM Sales.SalesOrderDetail;"
After that we must create the primary keys for both table, and set constraint between master and child table. Create by DbCommandBuilder SQLCommands for Insert, Update, Delete.
In that case we will have possibility to any modification data in these tables remotely and after Update records in DB (using "Addater.Update(DataSet)").
If we will use one SelectCommand to load data in two tables in DataSet, can we use that SelectCommand for DbCommandBuilder for creating other SQLCommands for "Update" and Update all tables in DataSet by one "Addater.Update(DataSet)" or we must create separate Addapter for Update every table?
If I for economy resources will load only part of records (see below) from table (e.g. SalesOrderDetail). Do I right understand, that in this case, I can have a possible problems, when I will send new records to DB (by Update), because news records can conflict with existen in DB by primary key (some records which have other value in OrderDate field)?
"SELECT *
FROM Sales.SalesOrderDetail
WHERE DATEPART(YEAR, OrderDate) = #year;"
There is nothing preventing you from writing your own Insert, Update and Delete commands for your first select statement with the join. Of course you will have to determine a way to assure that the foreign keys exist.
Insert Into SalesOrderDetail (SalesOrderID, OrderQty, ProductID, UnitPrice) Values ( #SalesOrderID, #OrderQty, #ProductID, #UnitPrice);
Update SalesOrderDetail Set OrderQty = #OrderQty Where SalesOrderDetailID = #ID;
Delete From SalesOrderDetail Where SalesOrderDetailID = #ID;
You would execute these with ADO.net commands instead of using the adapter. I wrote the sample code in vb.net but I am sure it is easy to change to C# if you prefer.
Private Sub UpdateQuantity(Quant As Integer, DetailID As Integer)
Using cn As New SqlConnection("Your connection string"),
cmd As New SqlCommand("Update SalesOrderDetail Set OrderQty = #OrderQty Where SalesOrderDetailID = #ID;")
cmd.Parameters.Add("#OrderQty", SqlDbType.Int).Value = Quant
cmd.Parameters.Add("#ID", SqlDbType.Int).Value = DetailID
cn.Open()
cmd.ExecuteNonQuery()
End Using
End Sub

Postgresql get references from a dictionary

I'm trying to build a request to get the data from a table, but some of those columns have foreign keys I would like to replace by the associated keyword in one request.
Basically there's
table A with column 1:PKA-ID and column 2:name.
table B with column 1:PKB-ID, column 2:FKA-ID, column 3:amount.
I want to get all the lines in table B but with all foreign keys replaced by the associated names in table A.
I started building a request with a subrequest + alias to get that, but ofc I have more than one result per subrequest, yet I can't find a way to link that subrequest to the ID of table B [might be exhausted, dumb or both] from the main request. I did something like that:
SELECT (SELECT "NAME" FROM A JOIN B ON ID = FKA-ID) AS name, amount FROM TABLEB;
it feels so simple of a request yet...
You don't need a join in the subselect.
SELECT pkb_id,
(SELECT name FROM a WHERE a.pka_id = b.fka_id),
amount
FROM b;
(See it live in SQL Fiddle).
The subselect query runs for each and every row of its parent select and has the parent row available from the context.
You can also use a simple join.
SELECT b.pkb_id, a.name, b.amount
FROM b, a
WHERE a.pka_id = b.fka_id;
Note that the join version puts less restrictions on the PostgreSQL query optimizer so in some cases the join version might work faster. (For example, in PostgreSQL 9.6 the join might utilize multiple CPU units, cf. Parallel Query).

How to optimise tables in Netezza to compliment a join with date conditions

I have two tables that I need to join in Netezza and one of them is very large
I have a dimension table that is a customer table which has two fields, customer id and an observation date i.e.
cust_id, obs_date
'a','2015-01-05'
'b','2016-02-03'
'c','2014-05-21'
'd','2016-01-31'
I have a fact table that is transactional and very high in volume. It has a lot of transactions per customer per date i.e.
cust_id, tran_date, transaction_amt
'a','2015-01-01',1
'a','2015-01-01',2
'a','2015-01-01',5
'a','2015-01-02',7
'a','2015-01-02',2
'b','2016-01-02',12
Both tables are distributed by the same key - cust_id
However When I join the tables, i need to join given the date condition. The query is very fast when i just join them together, but when I add the date condition it does not seem optimised. Does anyone have tips on how to set up the underlying tables or write the join?
I.e. sum transaction_amt for each customer for all their transactions for the 3 months up to their obs_date
FROM CUSTOMER_TABLE
INNER JOIN TRANSACTION_TABLE
ON CUSTOMER_TABLE.cust_id = TRANSACTION_TABLE.cust_id
AND TRANSACTION_TABLE.TRAN_DATE BETWEEN CUSTOMER_TABLE.OBS_DATE - 30 AND CUSTOMER_TABLE.OBS_DATE
If your transaction table is sufficiently large, it may benefit from using CBTs.
If you can, create a copy of the table that uses TRAN_DATE to organize (I'm guessing at your ddl here):
create table transaction_table (
cust_id varchar(20)
,tran_date date
,transaction_amt numeric(10,0)
) distribute on (cust_id)
organize on (tran_date);
Join to that and see if performance is improved. You could also use a materialized view for just those columns, but I think a CBT would be more useful here.
As Scott mentions in the comments below, you should either sort by the date on insert or groom the records after to make sure that they are sorted appropriately.

optimizing a slow postgresql query against multiple tables

One of our PostgreSQL queries started getting slow (~15 seconds) so we looked at migrating to a Graph database. Early tests show significantly faster speeds, so AWESOME.
Here's the problem- we still need to store a backup of the data in Postgres for non-analytics needs. The Graph database is just for analytics, and we'd prefer for it to remain a secondary data store. Because our business logic changed quite a bit during this migration, two existing tables turned into 4 -- and the current 'backup' selects in Postgres take anywhere from 1 to 6 minutes to run.
I've tried a few ways to optimize this, and the best seems to be turning this into two queries. If anyone can suggest obvious mistakes here , I'd love to hear a suggestion. I've tried switching up left/right/inner joins with little difference in the query planner. The join order does affect a difference ; I think I'm just not getting this correctly.
I'll go into details.
Goal : Retrieve the last 10 attachments sent to a given person
Database Structure :
CREATE TABLE message (
id SERIAL PRIMARY KEY NOT NULL ,
body_raw TEXT
);
CREATE TABLE attachments (
id SERIAL PRIMARY KEY NOT NULL ,
body_raw TEXT
);
CREATE TABLE message_2_attachments (
message_id INT NOT NULL REFERENCES message(id) ,
attachment_id INT NOT NULL REFERENCES attachments(id)
);
CREATE TABLE mailings (
id SERIAL PRIMARY KEY NOT NULL ,
event_timestamp TIMESTAMP not null ,
recipient_id INT NOT NULL ,
message_id INT NOT NULL REFERENCES message(id)
);
sidenote: the reason why a mailing is abstracted from the message is that a mailing often has more than one recipient /and/ a single message can go out to multiple recipients
This query takes about 5 minutes on a relatively small dataset (query planner time is the comment above each item ) :
-- 159374.75
EXPLAIN ANALYZE SELECT attachments.*
FROM attachments
JOIN message_2_attachments ON attachments.id = message_2_attachments.attachment_id
JOIN message ON message_2_attachments.message_id = message.id
JOIN mailings ON mailings.message_id = message.id
WHERE mailings.recipient_id = 1
ORDER BY mailings.event_timestamp desc limit 10 ;
Splitting it up into 2 queries only takes 1/8 the time :
-- 19123.22
EXPLAIN ANALYZE SELECT message_2_attachments.attachment_id
FROM mailings
JOIN message ON mailings.message_id = message.id
JOIN message_2_attachments ON message.id = message_2_attachments.message_id
JOIN attachments ON message_2_attachments.attachment_id = attachments.id
WHERE mailings.recipient_id = 1
ORDER BY mailings.event_timestamp desc limit 10 ;
-- 1.089
EXPLAIN ANALYZE SELECT * FROM attachments WHERE id IN ( results of above query )
I've tried re-writing the queries a handful of times -- different join orders, different types of joins, etc. I can't seem to make this anywhere nearly as efficient in a single query as it can be in two.
UPDATED Github has better formatting, so the full output of explain is here - https://gist.github.com/jvanasco/bc1dd38ca06e52c9a090
Plugged in the output of your explain here : http://explain.depesz.com/s/hqPT
As you can see, the :
Hash Join (cost=96588.85..158413.71 rows=44473 width=3201) (actual time=22590.630..30761.213 rows=44292 loops=1)
Hash Cond: (message_2_attachment.attachment_id = attachment.id)
is taking a good amount of time. I'd try to add indexes to the foreign keys as well with :
CREATE INDEX idx_message_2_attachments_attachment_id ON "message_2_attachments" USING btree (attachment_id);
CREATE INDEX idx_message_2_attachments_message_id ON "message_2_attachments" USING btree (message_id);`
CREATE INDEX idx_mailings_message_id ON "mailings" USING btree (message_id);
The junction table is missing a primary key. Also it is advisable to add a reversed index on this PK:
CREATE TABLE message_2_attachments (
message_id INT NOT NULL REFERENCES message(id) ,
attachment_id INT NOT NULL REFERENCES attachments(id)
, PRIMARY KEY (message_id,attachment_id) -- <<== here
);
CREATE UNIQUE INDEX ON message_2_attachments(attachment_id,message_id); -- <<== here
For the mailings table, the situation is not so clear. It looks like some combination of {event_timestamp, recipient_id, message_id} could function as a candidate key. The id field merely functions as a surrogate.