How to correctly GROUP BY on jdbc sources - group-by

I have a Kafka stream with user_id and want to produce another stream with user_id and number of records in a JDBC table.
Following is how I tried to achieve this (I'm new to flink, so please correct me if that's not how things are supposed to be done). The issue is that flink ignores all updates to JDBC table after the job has started.
As far as I understand the answer to this is to use lookup joins but flink complains that lookup joins are not supported on temporal views. Also tried to use versioned views without much success.
What would be the correct approach to achieve what I want?
CREATE TABLE kafka_stream (
user_id STRING,
event_time TIMESTAMP(3) METADATA FROM 'timestamp',
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
-- ...
)
-- NEXT SQL --
CREATE TABLE jdbc_table (
user_id STRING,
checked_at TIMESTAMP,
PRIMARY KEY(user_id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
-- ...
)
-- NEXT SQL --
CREATE TEMPORARY VIEW checks_counts AS
SELECT user_id, count(*) as num_checks
FROM jdbc_table
GROUP BY user_id
-- NEXT SQL --
INSERT INTO output_kafka_stream
SELECT
kafka_stream.user_id,
checks_counts.num_checks
FROM kafka_stream
LEFT JOIN checks_counts ON kafka_stream.user_id = checks_counts.user_id

Related

Postgresql - how can I clone a set of records but maintain a mapping between the original ids and the new ids

I come from a SQL Server background and our team is migrating to Postgres (version 9.5).
We have a number of scripts that perform MERGE statements that essentially 'clone' rows in a table and insert them back into the same table with a new Id while maintaining a map between the cloned records and the records they were cloned from.
I'm having a hard time trying to replicate this behavior. I've tried a number of variations, but I still can't seem to find the right combination of temp tables and CTEs to get it right.
Here's an approximation of the latest version that doesn't work:
CREATE SCHEMA stackoverflow;
CREATE TABLE stackoverflow.clone_problem
(
id bigserial PRIMARY KEY NOT NULL,
some_id bigint NULL,
some_other_id bigint NULL,
modified_time timestamp NOT NULL DEFAULT now(),
modified_by varchar(128) NOT NULL DEFAULT current_user
);
INSERT INTO stackoverflow.clone_problem
(
id,
some_id,
some_other_id
)
VALUES (1,1,1)
,(2,2,2)
,(3,3,3);
;WITH sources
AS
(
SELECT
id as old_id,
some_id,
some_other_id
FROM stackoverflow.clone_problem
WHERE id = ANY('{1,3}')
),
inserts
AS
(
INSERT INTO stackoverflow.clone_problem
(
some_id,
some_other_id
)
SELECT
s.some_id,
s.some_other_id
FROM sources s
RETURNING id as new_id, s.id as old_id -- this doesn't work
)
SELECT * from inserts;
The final select statement is the output I'm trying to capture--either from a RETURNING statement of by other means-- so we know which records were cloned and what their new Ids are. But the code above throws this error: error: missing FROM-clause entry for table "s".
I don't understand because 's' is in the FROM clause so the error seems counterintuitive to me. I'm sure I'm missing something dumb, but I just can't seem to figure how to get that final piece of information.
Any help would be greatly appreciated.
I think your only chance is to generate the ID before you do the insert so that you have the mapping between old and new ID right away. This can be done by calling nextval() when retrieving the source rows, then providing that already generated ID during the INSERT
with sources as (
SELECT id as old_id,
nextval(pg_get_serial_sequence('stackoverflow.clone_problem', 'id')) as new_id,
some_id,
some_other_id
FROM clone_problem
WHERE id IN (1,3)
), inserts as (
INSERT INTO clone_problem (id, some_id, some_other_id)
SELECT s.new_id,
s.some_id,
s.some_other_id
FROM sources s
)
select old_id, new_id
from sources;
By using pg_get_serial_sequence you don't need know the name of the sequence directly.

postgress: insert rows to table with multiple records from other join tables

ّ am trying to insert multiple records got from the join table to another table user_to_property. In the user_to_property table user_to_property_id is primary, not null it is not autoincrementing. So I am trying to add user_to_property_id manually by an increment of 1.
WITH selectedData AS
( -- selection of the data that needs to be inserted
SELECT t2.user_id as userId
FROM property_lines t1
INNER JOIN user t2 ON t1.account_id = t2.account_id
)
INSERT INTO user_to_property (user_to_property_id, user_id, property_id, created_date)
VALUES ((SELECT MAX( user_to_property_id )+1 FROM user_to_property),(SELECT
selectedData.userId
FROM selectedData),3,now());
The above query gives me the below error:
ERROR: more than one row returned by a subquery used as an expression
How to insert multiple records to a table from the join of other tables? where the user_to_property table contains a unique record for the same user-id and property_id there should be only 1 record.
Typically for Insert you use either values or select. The structure values( select...) often (generally?) just causes more trouble than it worth, and it is never necessary. You can always select a constant or an expression. In this case convert to just select. For generating your ID get the max value from your table and then just add the row_number that you are inserting: (see demo)
insert into user_to_property(user_to_property_id
, user_id
, property_id
, created
)
with start_with(current_max_id) as
( select max(user_to_property_id) from user_to_property )
select current_max_id + id_incr, user_id, 3, now()
from (
select t2.user_id, row_number() over() id_incr
from property_lines t1
join users t2 on t1.account_id = t2.account_id
) js
join start_with on true;
A couple notes:
DO NOT use user for table name, or any other object name. It is a
documented reserved word by both Postgres and SQL standard (and has
been since Postgres v7.1 and the SQL 92 Standard at lest).
You really should create another column or change the column type
user_to_property_id to auto-generated. Using Max()+1, or
anything based on that idea, is a virtual guarantee you will generate
duplicate keys. Much to the amusement of users and developers alike.
What happens in an MVCC when 2 users run the query concurrently.

Best way to join two (or more) kafka topics in KSQL emiting changes from all topics?

We have a "microservices" platform and we are using debezium for change data capture from databases on these platforms which is working nicely.
Now, we'd like to make it easy for us to join these topics and stream the results into a new topic which could be consumed by multiple services.
Disclaimer: this assumes v0.11 ksqldb and cli (seems like much of this might not work in older versions)
Example of two tables from two database instances that stream into Kafka topics:
-- source identity microservice (postgres)
CREATE TABLE public.user_entity (
id varchar(36) NOT NULL,
first_name varchar(255) NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
-- source organization microservice (postgres)
CREATE TABLE public.user_info (
id varchar(36) NOT NULL,
user_entity_id varchar(36) NOT NULL,
business_unit varchar(255) NOT NULL,
cost_center varchar(255) NOT NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
Option 1 : Streams
CREATE STREAM stream_user_info_by_user_entity_id
AS SELECT * FROM stream_user_info
PARTITION BY user_entity_id
EMIT CHANGES;
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN stream_user_info_by_user_entity_id ui WITHIN 365 DAYS ON ue.id = ui.user_entity_id
EMIT CHANGES;
Notice WITHIN 365 DAYS, conceptually these tables could go a very long time without being changed so this window would be technically infinitely large. This looks fishy and seems to hint that this is not a good way to do this.
Option 2 : Tables
CREATE TABLE ktable_user_info_by_user_entity_id (
user_entity_id,
first_name,
business_unit,
cost_center
)
with (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON ue.id = ui.user_entity_id
EMIT CHANGES;
We no longer need the window WITHIN 365 DAYS, so this feels more correct. However this only emits a change when a message is sent to the stream not the table.
In this example:
User updates first_name -> change is emitted
User updates business_unit -> no change emitted
Perhaps there might be a way to create a merged stream partitioned by the user_entity_id and join to child tables which would hold the current state, which leads me to....
Option 3 : Merged stream and tables
-- "master" change stream with merged stream output
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
CREATE STREAM stream_user_entity_by_id
AS SELECT * FROM stream_user_entity
PARTITION BY id
EMIT CHANGES;
CREATE TABLE ktable_user_entity_by_id (
id VARCHAR PRIMARY KEY,
first_name VARCHAR
) with (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
This one looks the best, but appears to have a lot of moving components for each table we have 2 streams, 1 insert query, 1 ktable. Another potential issue here might be a hidden race condition where the stream emits the change before the table is updated under the covers.
Option 4 : More merged tables and streams
CREATE STREAM stream_user_entity_changes_enriched
AS SELECT
ue.id AS user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_by_id ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
CREATE STREAM stream_user_info_changes_enriched
AS SELECT
ui.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_info_by_user_entity_id ui
LEFT JOIN ktable_user_entity_by_id ue ON ui.user_entity_id = ue.id
EMIT CHANGES;
CREATE STREAM stream_user_changes_enriched (user_entity_id VARCHAR, first_name VARCHAR, business_unit VARCHAR, cost_center VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes_enriched', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_entity_changes_enriched;
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_info_changes_enriched;
This is conceptually the same as the earlier one, but the "merging" happens after the joins. Conceivably, this might eliminate any potential race condition because we're selecting primarily from the streams instead of the tables.
The downside is that the complexity is even worse than option 3 and writing and tracking all these streams for any joins with more than two tables would be kind of mind numbing...
Question :
What method is the best for this use case and/or are we attempting to do something that ksql shouldn't be used for? Would we better off to just offload this to traditional RDBMS or spark alternatives?
I'm going to attempt to answer my own question, only accept if upvoted.
The answer is: Option 3
Here are the reasons for this use case this would be the best, while perhaps could be subjective
Streams partitioned by primary and foreign keys are common and simple.
Tables based on these streams are common and simple.
Tables used in this way will not be a race condition.
All options have merits, e.g. if you don't care about emitting all the changes or if the data behaves like streams (logs or events) instead of slow changing dimensions (sql tables).
As for "race conditions", the word "table" tricks your mind that you are actually processing and persisting data. In reality is that they are not actually physical tables, they actually behave more like sub-queries on streams. Note: It might be an exception for aggregation tables which actually produce topics (which I would suggest is a different topic, but would love to see comments)
In the end (syntax may have some slight bugs):
---------------------------------------------------------
-- shared objects (likely to be used by multiple queries)
---------------------------------------------------------
-- shared streams wrapping topics
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
-- shared keyed streams (i like to think of them as "indexes")
CREATE STREAM stream_user_entity_by_id AS
SELECT * FROM stream_user_entity PARTITION BY id
EMIT CHANGES;
CREATE STREAM stream_user_info_by_user_entity_id AS
SELECT * FROM stream_user_info PARTITION BY user_entity_id
EMIT CHANGES;
-- shared keyed tables (inferring columns with schema registry)
CREATE TABLE ktable_user_entity_by_id (id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
CREATE TABLE ktable_user_info_by_user_entity_id (user_entity_id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
---------------------------------------------------------
-- query objects (specific to the produced data)
---------------------------------------------------------
-- "master" change stream (include all tables in join)
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
-- pretty simple looking query
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
The "shared" objects are basically the streaming schema (temptation is to create for all our topics, but that's another question) and the second portion is like the query schema. It ultimately is a functional, clean, and repeatable pattern.
i like your approach number 3. indeed i have tried to use that one to merge streams with different primary keys into one master stream and then grouping them in a materialized view.
the join seems to work, but i ended up having the same situation as in a regular stream-table join ... indeed i see changes in the master stream, but somehow those changes are only triggered downstream (to the group by table) when they affect the first table in the table-stream join and not the others.
So basically what i have achieved is the following:
debezium --> create 3 streams: A,B,AB (AB is a matching table between ids in A and ids in B, used in postgres to make an n-to-n join)
stream A,B,C are repartitioned by one id (A_id) and merged into one stream. in this step all elements of B get assigned a virtual A_id since it is not relevant to them
the 3 KTables are created (i still keep wondering, why? is this a sort of self-join?)
a materialized view (table) is created by grouping the master stream after the joining it to the 3 KTables
when a change in A,B, or AB happens, i see changes in the master stream too, but the materialized view is only updated when changes on stream B occur. Of course destroying the table and recreating it makes it "up-to-date".
are you facing the same problem?

Will i get benefits of hyper table if I have a query in which I join a hyper table with a normal (non-hyper) table in timescaledb

I have to fetch record from two tables, there is one table is hyper table another table is normal table.
Hyper table primary key (a UUID, not a timestampz column) is used as foreign key in 2nd normal table.
The hyper table has one to many relationship with the normal table.
Will I get all benefits of hyper table here if I select record after joining this table?
I am using postgresql database for timescale.
Below are create table queries for same. The demography_person is the hypertable and the emotions_person is the normal table
CREATE TABLE public.demography_person
(
start_timestamp timestamp with time zone NOT NULL,
end_timestamp timestamp with time zone,
demography_person_id character varying NOT NULL,
device_id bigint,
age_actual numeric,
age_band integer,
gender integer,
dwell_time_in_millis bigint,
customer_id bigint NOT NULL
);
SELECT create_hypertable('demography_person', 'start_timestamp');
CREATE TABLE public.emotions_person
(
emotion_start_timestamp timestamp with time zone NOT NULL,
demography_person_id character varying NOT NULL,
count integer,
emotion integer,
emotion_percentage numeric
);
select sql Query is like:-
SELECT * FROM crosstab
(
$$
SELECT * FROM ( select to_char(dur,'HH24') as duration , dur as time_for_sorting from
generate_series(
timestamp '2019-04-01 00:00:00',
timestamp '2020-03-09 23:59:59' ,
interval '1 hour'
) as dur ) d
LEFT JOIN (
select to_char(
start_timestamp ,
'HH24'
)
as duration,
emotion,count(*) as count from demography_person dp INNER JOIN (
select distinct ON (demography_person_id) demography_person_id, emotion_start_timestamp,count,emotion,emotion_percentage,
(CASE emotion when 4 THEN 1 when 6 THEN 2 when 1 THEN 3 WHEN 3 THEN 4 WHEN 2 THEN 5 when 7 THEN 6 when 5 THEN 7 ELSE 8 END )
as emotion_key_for_sorting from emotions_person where demography_person_id in (select demography_person_id from demography_person where start_timestamp >= '2019-04-01 00:00:00'
AND start_timestamp <= '2020-03-09 23:59:59' AND device_id IN ( 2052,2692,1797,2695,1928,2697,2698,1931,2574,2575,2706,1942,1944,2713,1821,2719,2720,2721,2722,2723,2596,2725,2217,2603,1852,2750,1726,1727,2754,2757,1990,2759,2760,2376,2761,2762,2257,2777,2394,2651,2652,1761,2658,1762,2659,2788,2022,2791,2666,1770,2026,2028,2797,2675,1780,2549 ))
order by demography_person_id asc,emotion_percentage desc, emotion_key_for_sorting asc
) ep ON
ep.demography_person_id = dp.demography_person_id
WHERE start_timestamp >= '2019-04-01 00:00:00'
AND start_timestamp <= '2020-03-09 23:59:59' AND device_id IN ( 2052,2692,1797,2695,1928,2697,2698,1931,2574,2575,2706,1942,1944,2713,1821,2719,2720,2721,2722,2723,2596,2725,2217,2603,1852,2750,1726,1727,2754,2757,1990,2759,2760,2376,2761,2762,2257,2777,2394,2651,2652,1761,2658,1762,2659,2788,2022,2791,2666,1770,2026,2028,2797,2675,1780,2549 ) AND gender IN ( 1,2 )
group by 1,2 ORDER BY 1,2 ASC
) t USING (duration) GROUP BY 1,2,3,4 ORDER BY time_for_sorting;
$$ ,
$$
select emotion from (
values ('1'), ('2'), ('3'),('4'), ('5'), ('6'),('7'), ('8')
) t(emotion)
$$
) AS ct
(
duration text,
time_for_sorting timestamp,
ANGER bigInt,
DISGUSTING bigInt,
FEAR bigInt,
HAPPY bigInt,
NEUTRAL bigInt,
SAD bigInt,
SURPRISE bigInt,
NO_DETECTION bigInt
);
Will i get benefits of hyper table if I have a query in which I join a hyper table with a normal (non-hyper) table in timescaledb
I don't fully understand the question and see 2 interpretations for it:
Will I benefit from using TimescaleDB and hypertable just for improving this query?
Can I join a hypertable and a normal table and how to make the above query to perform better?
If you just need to execute a complex query over large dataset, PostgreSQL can do good job if you provide indexes. TimescaleDB provides benefits for Timeseries workflows especially when a workflow includes data in-order ingesting, time-related queries, timeseries operators and/or usage TimescaleDB specific functionality such as continuous aggregates and compression, i.e., not just a query. TimescaleDB is designed for large volumes of timeseries data. I hope it clarifies the first question.
In TimescaleDB it is very common to join hypertable, which stores timeseries data, and a normal table, which contains metadata on timerseries data. TimescaleDB implements constraint exclusion to improve query performance. However, it might not be applied in some cases due to uncommon query expressions or too complex queries.
The query in the question is very complex. So I suggest to use ANALYZE on the query to see if the query planner misses some optimisations.
I see that the query generates data and I doubt it can be done much to produce good query plan. So this is my biggest concern for getting good performance. It would be great if you can explain motivation around the generating data inside the query.
Another issue, which I see, is a nested query demography_person_id in (select demography_person_id from demography_person ... in a where condition. And the outer query is a part in a inner join with the same table as in the nested query. I expect it can be rewritten without nested subquery utilising inner join.
I doubt that TimescaleDB or PostgreSQL can do much to execute query efficiently. The query requires manual human rewriting.

How to select from explicit partition in PostgreSQL

In Oracle, MySQL I can select from partition
SELECT ... FROM ... PARTITION (...)
In SQL Server syntax is a bit different involving partitioning function.
is there a way to do it in PostgreSQL?
thank you!
PostgreSQL provided partitioning through table inheritance.
Partitions are child tables with an unique name like any other table, so you just select from them directly with their names. The only special case is for the parent table: to select data from the parent table ignoring the child tables, an additional keyword ONLY is used as in SELECT * FROM ONLY parent_table.
Example from the manual:
CREATE TABLE measurement_y2006m02 (
CHECK ( logdate >= DATE '2006-02-01' AND logdate < DATE '2006-03-01' )
) INHERITS (measurement);
So select * from measurement_y2006m02 would get data from only this partition.