I'm trying to use KSQL (as part of confluent-5.0.0) to create a single record out of a set of parent records and child records, where every parent record has multiple child records (spefically, payment details and the parties involved in the payment). These parent/child records are linked by the parent's id. To illustrate, I'm dealing with records of roughly this structure in the source system:
payment:
| id | currency | amount | payment_date |
|------------------------------------------|
| pmt01 | USD | 20000 | 2018-11-20 |
| pmt02 | USD | 13000 | 2018-11-23 |
payment_parties:
| id | payment_id | party_type | party_ident | party_account |
|-----------------------------------------------------------------|
| prt01 | pmt01 | sender | XXYYZZ23 | (null) |
| prt02 | pmt01 | intermediary | AADDEE98 | 123456789 |
| prt03 | pmt01 | receiver | FFGGHH56 | 987654321 |
| prt04 | pmt02 | sender | XXYYZZ23 | (null) |
| prt05 | pmt02 | intermediary | (null) | (null) |
| prt06 | pmt02 | receiver | FFGGHH56 | 987654321 |
These records are loaded, in AVRO format, onto a set of Kafka topics using Oracle Golden Gate, with one topic for every table. This means the following topics exist: src_payment and src_payment_parties. As per the way the source system functions, the timestamps of these records fall within several milliseconds.
Now, the intent is to 'flatten' these records into a single record, which will be consumed from an outgoing topic. To illustrate, for the records above, the desired output would be along these lines:
payment_flattened:
| id | currency | amount | payment_date | sender_ident | sender_account | intermediary_ident | intermediary_account | receiver_ident | receiver_account |
|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| pmt01 | USD | 20000 | 2018-11-20 | XXYYZZ23 | (null) | AADDEE98 | 123456789 | FFGGHH56 | 987654321 |
| pmt02 | USD | 13000 | 2018-11-23 | XXYYZZ23 | (null) | (null) | (null) | FFGGHH56 | 987654321 |
The first question I'd like to ask here, is the following: How can I best achieve this combination of data from the source topics?
Of course, I have tried some actions myself. In the interest of brevity, I'll describe what I have tried to achieve appending the first of the payment parties to the payment records.
Step one: set up the source streams
Note: due to the OGG setup adding a property called 'table' to the AVRO schema, I have to specify the fields to take from the topic. Additionally, I'm not interested in the fields specifying the type of operation (e.g. insert or update).
create stream payment_stream (id varchar, currency varchar, amount double, \
payment_date varchar) with (kafka_topic='src_payment',value_format='avro');
create stream payment_parties_stream (id varchar, payment_id varchar, party_type varchar, \
party_ident varchar, party_account varchar) with (kafka_topic='src_payment_parties',\
value_format='avro');
Step two: create stream for the payment senders
Note: from what I've gathered from the documentation, and found out from experimenting, in order to be able to join the payment stream to a payment party stream, the latter needs to be partitioned by the payment id. Additionally, the only way I have gotten the join to work is by renaming the column.
create stream payment_sender_stream as select payment_id as id, party_ident, \
party_account from payment_parties_stream where party_type = 'sender' partition by id;
Step three: join two streams
Note: I'm using a left join, because not all parties are present for every payment. As in the example records above, where pmt02 does not have an intermediary.
create stream payment_with_sender as select pmt.id as id, pmt.currency, pmt.amount, \
pmt.payment_date, snd.party_ident, snd.party_account from payment_stream pmt left join \
payment_sender_stream snd within 1 seconds on pmt.id = snd.id;
Now, the output I would expect from this stream is something along these lines:
ksql> select * from payment_with_sender;
rowtime | pmt01 | pmt01 | USD | 20000 | 2018-11-20 | XXYYZZ23 | null
rowtime | pmt02 | pmt02 | USD | 13000 | 2018-11-23 | XXYYZZ23 | null
Instead, the output I'm seeing is along these lines:
ksql> select * from payment_with_sender;
rowtime | pmt01 | pmt01 | USD | 20000 | 2018-11-20 | null | null
rowtime | pmt01 | pmt01 | USD | 20000 | 2018-11-20 | XXYYZZ23 | null
rowtime | pmt02 | pmt02 | USD | 13000 | 2018-11-23 | null | null
rowtime | pmt02 | pmt02 | USD | 13000 | 2018-11-23 | XXYYZZ23 | null
Hence, the second (two-part) question I'd like to ask is: Why does the left join produce these duplicate records? And can this be avoided?
Apologies for the wall of text, I tried to be as complete as possible in the description of the issue. Of course, I'd be happy to add any possible missing information, and answer questions regarding the setup to the best of my knowledge.
You're almost there :-)
WITHIN 1 SECONDS will give you results triggered from both sides of the join.
Instead, try WITHIN (0 SECONDS, 1 SECONDS). Then only records from the right side of the join will be joined to the left, and not visa versa.
You can read more about this pattern in the article I wrote here.
BTW if you want to work around the table reserved word issue from OGG, you can set includeTableName to false in the GG config.
Related
I'm trying to fill a table with data to test a system.
I have two tables
User
+----+----------+
| id | name |
+----+----------+
| 1 | Majikaja |
| 2 | User 2 |
| 3 | Markus |
+----+----------+
Goal
+----+----------+---------+
| id | goal | user_id |
+----+----------+---------+
I want to insert into goal one record for every user only using their IDs (they have to exists) and some fixed or random value.
I was thinking in something like this:
INSERT INTO Goal (goal, user_id) values ('Fixed value', select u.id from user u)
So it will generate:
Goal
+----+-------------+---------+
| id | goal | user_id |
+----+-------------+---------+
| 1 | Fixed value | 1 |
| 2 | Fixed value | 2 |
| 3 | Fixed value | 3 |
+----+-------------+---------+
I could just write a simple PHP script to achieve it but I wonder if is it possible to do using raw SQL only.
I am using libpq to connect the Postgres server in c++ code. Postgres server version is 12.10
My table schema is defined below
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------------+----------+-----------+----------+------------+----------+--------------+-------------
event_id | bigint | | not null | | plain | |
event_sec | integer | | not null | | plain | |
event_usec | integer | | not null | | plain | |
event_op | smallint | | not null | | plain | |
rd | bigint | | not null | | plain | |
addr | bigint | | not null | | plain | |
masklen | bigint | | not null | | plain | |
path_id | bigint | | | | plain | |
attribs_tbl_last_id | bigint | | not null | | plain | |
attribs_tbl_next_id | bigint | | not null | | plain | |
bgp_id | bigint | | not null | | plain | |
last_lbl_stk | bytea | | not null | | extended | |
next_lbl_stk | bytea | | not null | | extended | |
last_state | smallint | | | | plain | |
next_state | smallint | | | | plain | |
pkey | integer | | not null | 1654449420 | plain | |
Partition key: LIST (pkey)
Indexes:
"event_pkey" PRIMARY KEY, btree (event_id, pkey)
"event_event_sec_event_usec_idx" btree (event_sec, event_usec)
Partitions: event_spl_1651768781 FOR VALUES IN (1651768781),
event_spl_1652029140 FOR VALUES IN (1652029140),
event_spl_1652633760 FOR VALUES IN (1652633760),
event_spl_1653372439 FOR VALUES IN (1653372439),
event_spl_1653786420 FOR VALUES IN (1653786420),
event_spl_1654449420 FOR VALUES IN (1654449420)
When I execute the following query it takes 1 - 2 milliseconds to execute.
Time is provided as a parameter to function executing this query, it contains epoche seconds and microseconds.
SELECT event_id FROM event WHERE (event_sec > time.seconds) OR ((event_sec=time.seconds) AND (event_usec>=time.useconds) ORDER BY event_sec, event_usec LIMIT 1
This query is executed every 30 seconds on the same client connection (Which is persistent for weeks). This process runs for weeks, but some time same query starts taking more than 10 minutes.
If I restart the process it recreated connection with the server and now execution time again falls back to 1-2 milliseconds. This issue is intermittent, sometimes it triggers after a week of running process and some time after 2 - 3 weeks of running process.
We add a new partition to table every Sunday and write new data in new partition.
I don't know why the performance is inconsistent, there are many possibilities we can't distinguish with the info provided. Like, does the plan change when the performance changes, or does the same plan just perform worse?
But your query is not written to take maximal advantage of the index. In my hands it can use the index for ordering, but then it still needs to read and individually skip over things that fail the WHERE clause until it finds the first one which passes. And due to partitioning, I think it is even worse than that, it has to do this read-and-skip until it finds the first one which passes in each partition.
You could rewrite it to do a tuple comparison, which can use the index to determine both the order, and where to start:
SELECT event_id FROM event
WHERE (event_sec,event_sec) >= (:seconds,:useconds)
ORDER BY event_sec, event_usec LIMIT 1;
Now this might also degrade, or might not, or maybe will degrade but still be so fast that it doesn't matter.
I have two tables products and product_attributs. One Product can have one or many attributs and these are filled by a dynamic web form (name and value inputs) added by the user as needed. For example for a drill the user could decide to add two attributs : color=blue and power=100 watts. For another product it could be 3 or more different attribus and for another it could have no special attributs.
products
| id | name | identifier | identifier_type | active
| ----------|--------------|-------------|------------------|---
| 1 | Drill | AD44 | barcode | true
| 2 | Polisher | AP211C | barcode | true
| 3 | Jackhammer | AJ2133 | barcode | false
| 4 | Screwdriver | AS4778 | RFID | true
product_attributs
|id | name | value | product_id
|----------|--------------|-------------|----------
|1 | color | blue | 1
|2 | power | 100 watts | 1
|3 | size | 40 cm | 2
|4 | energy | electrical | 3
|4 | price | 35€ | 3
so attributs could be anything which are set dynamically by the user. My need is to generate a report on CSV which contain all products with their attributs. Without a good experience in SQL I generated the following basic request :
SELECT pr.name, pr.identifier_type, pr.identifier, pr.active, att.name, att.value
FROM products as pr
LEFT JOIN product_attributs att ON pr.id = att.product_id
as you know the result will contain for the same product as many rows as attributs it has and this is not ideal for reporting. The ideal would be this :
|name | identifier_type | identifier | active | name | value | name | value
|-----------|-----------------|------------|--------|--------|-------|------ |------
|Drill | barcode | AD44 | true | color | blue | power | 100 w
|Polisher | barcode | AP211C | true | size | 40 cm | null | null
|Jackhammer | barcode | AJ2133 | true | energy | elect | price | 35 €
|Screwdriver| barcode | AS4778 | true | null | null | null | null
here I only showed a max of two attributes per product but it could be more if needed. Well I did some research and came across the pivot with crosstab function on Postgres but the problem it requests static values but this does not match my need.
thanks lot for your help and sorry for duplicates if any.
Thanks Laurenz Albe for your help. array_agg solved my problem. Here is the query if someone may be interested in :
SELECT
pr.name, pr.description, pr.identifier_type, pr.identifier,
pr.internal_identifier, pr.active,
ARRAY_TO_STRING(ARRAY_AGG (oa.name || ' = ' || oa.value),', ') attributs
FROM
products pr
LEFT JOIN product_attributs oa ON pr.id = oa.product_id
GROUP BY
pr.name, pr.description, pr.identifier_type, pr.identifier,
pr.internal_identifier, pr.active
ORDER BY
pr.name;
Googling for a definition either returns results for a column oriented DB or gives very vague definitions.
My understanding is that wide column stores consist of column families which consist of rows and columns. Each row within said family is stored together on disk. This sounds like how row oriented databases store their data. Which brings me to my first question:
How are wide column stores different from a regular relational DB table? This is the way I see it:
* column family -> table
* column family column -> table column
* column family row -> table row
This image from Database Internals simply looks like two regular tables:
The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores. So here is my second question:
Are wide column stores sorted from left to right? Meaning, in the above example, are the rows sorted first by Row Key, then by Timestamp, and finally by Qualifier?
Let's start with the definition of a wide column database.
Its architecture uses (a) persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format
meant for massive scalability (over and above the petabyte scale).
A relational database is designed to maintain the relationship between the entity and the columns that describe the entity. A good example is a Customer table. The columns hold values describing the Customer's name, address, and contact information. All of this information is the same for each and every customer.
A wide column database is one type of NoSQL database.
Maybe this is a better image of four wide column databases.
My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table. It's an attribute/value table within a particular entity (column).
For Customer information, the first wide-area database example might look like this.
Customer ID Attribute Value
----------- --------- ---------------
100001 name John Smith
100001 address 1 10 Victory Lane
100001 address 3 Pittsburgh, PA 15120
Yes, we could have modeled this for a relational database. The power of the attribute/value table comes with the more unusual attributes.
Customer ID Attribute Value
----------- --------- ---------------
100001 fav color blue
100001 fav shirt golf shirt
Any attribute that a marketer can dream up can be captured and stored in an attribute/value table. Different customers can have different attributes.
The Super Column model keeps the same information in a different format.
Customer ID: 100001
Attribute Value
--------- --------------
fav color blue
fav shirt golf shirt
You can have as many Super Column models as you have entities. They can be in separate NoSQL tables or put together as a Super Column family.
The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information.
Most (if not all) Wide-column stores are indeed row-oriented stores in that every parts of a record are stored together. You can see that as a 2-dimensional key-value store. The first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server.
Wide-column stores will have different features and behaviors. However, Apache Cassandra, for example, allows you to define how the data will be sorted. Take this table for example:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If your partitioning key is (id) and your clustering key is (country, timestamp), the data will be stored like this:
[Key 1]
1:JP,2020-11-01,"b..." | 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If you change the primary key (composite of partitioning and clustering key) to (id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id is the partitioning key, timestamp is the clustering key in descending order), the result would be:
[Key 1]
1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." | 1:JP,2020-11-01,"b..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
I have two tables, CompanyAddresses & MyCompanyAddresses. (Names changed to protect the guilty).
CompanyAddresses holds a list of default addresses for companies. These records are immutable. The user can change the details of a company address, but those changes are stored MyCompanyAddresses.
How can I produce a single list of addresses from both tables, excluding records from CompanyAddresses where a corresponding record exists in MyCompanyAddresses?
Sample Data
CompanyAddresses
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
1 | Guid1 | APL | Apple | 1 Infinite Loop | Cupertino | 95014 | 11/1/2012
2 | Guid2 | MS | Microsoft | One Microsoft Way | Redmond | 98052 | 11/1/2012
MyCompanyAddresses
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
5 | Guid3 | APL | Apple | Updated Address | Cupertino | 95014 | 11/6/2012
Desired Results
DatabaseId | Id | Code | Name | Street | City | Zip | Maint Date
2 | Guid2 | MS | Microsoft | One Microsoft Way | Redmond | 98052 | 11/1/2012
5 | Guid3 | APL | Apple | Updated Address | Cupertino | 95014 | 11/6/2012
I've tried various permutations of MS SQL's UNION, EXCEPT & INTERSECT to no avail. Also, I don't believe JOIN's are the answer either, but I'll be happily proven wrong.
The database design can be changed, but it would be preferable if it stayed the same.
Use a LEFT JOIN in combination with COALESCE. If the JOIN finds a match, the COALESCE will select values from the overridden row. If no match is found, the original values are returned.
SELECT ca.DatabaseId,
COALESCE(mca.Id, ca.Id) AS Id,
COALESCE(mca.Name, ca.Name) AS Name,
COALESCE(mca.Street, ca.Street) AS Street,
COALESCE(mca.City, ca.City) AS City,
COALESCE(mca.Zip, ca.Zip) AS Zip,
COALESCE(mca.MaintDate, ca.MaintDate) AS MaintDate,
FROM CompanyAddresses ca
LEFT JOIN MyCompanyAddresses mca
ON ca.Code = mca.Code;