Strategy for selecting records for DataSet - ado.net

In Most common cases, we have two tables (& more) in DB termed as master (e.g. SalesOrderHeader) & chirld (e.g. SalesOrderDetail).
We can read records from DB by one Select with INNER JOIN and additional constaints WHERE for lessen volume data for loading from DB (using "Addater.Fill(DataSet)")
#"SELECT d.SalesOrderID, d.SalesOrderDetailID, d.OrderQty,
d.ProductID, d.UnitPrice
FROM Sales.SalesOrderDetail d
INNER JOIN Sales.SalesOrderHeader h
ON d.SalesOrderID = h.SalesOrderID
WHERE DATEPART(YEAR, OrderDate) = #year;"
Did I understand right, in this case we receive one table in DataSet, w/o primary and foreign keys, and w/o possibility to set constraint between master and child tables.
This Dataset can be useful only for different queries regarding columns and record which exist in DataSet?
We can't using DbCommandBuilder for creating SQLCommands for Insert, Update, Delete based on the SelectCommand which was used for filling DataSet? And simply to Update data in these table in DB?
If we want to organize the local data moddification in tables by using the disconnect layer of ADO.NET, we must populate DataSet by two Select
"SELECT *
FROM Sales.SalesOrderHeader;"
"SELECT *
FROM Sales.SalesOrderDetail;"
After that we must create the primary keys for both table, and set constraint between master and child table. Create by DbCommandBuilder SQLCommands for Insert, Update, Delete.
In that case we will have possibility to any modification data in these tables remotely and after Update records in DB (using "Addater.Update(DataSet)").
If we will use one SelectCommand to load data in two tables in DataSet, can we use that SelectCommand for DbCommandBuilder for creating other SQLCommands for "Update" and Update all tables in DataSet by one "Addater.Update(DataSet)" or we must create separate Addapter for Update every table?
If I for economy resources will load only part of records (see below) from table (e.g. SalesOrderDetail). Do I right understand, that in this case, I can have a possible problems, when I will send new records to DB (by Update), because news records can conflict with existen in DB by primary key (some records which have other value in OrderDate field)?
"SELECT *
FROM Sales.SalesOrderDetail
WHERE DATEPART(YEAR, OrderDate) = #year;"

There is nothing preventing you from writing your own Insert, Update and Delete commands for your first select statement with the join. Of course you will have to determine a way to assure that the foreign keys exist.
Insert Into SalesOrderDetail (SalesOrderID, OrderQty, ProductID, UnitPrice) Values ( #SalesOrderID, #OrderQty, #ProductID, #UnitPrice);
Update SalesOrderDetail Set OrderQty = #OrderQty Where SalesOrderDetailID = #ID;
Delete From SalesOrderDetail Where SalesOrderDetailID = #ID;
You would execute these with ADO.net commands instead of using the adapter. I wrote the sample code in vb.net but I am sure it is easy to change to C# if you prefer.
Private Sub UpdateQuantity(Quant As Integer, DetailID As Integer)
Using cn As New SqlConnection("Your connection string"),
cmd As New SqlCommand("Update SalesOrderDetail Set OrderQty = #OrderQty Where SalesOrderDetailID = #ID;")
cmd.Parameters.Add("#OrderQty", SqlDbType.Int).Value = Quant
cmd.Parameters.Add("#ID", SqlDbType.Int).Value = DetailID
cn.Open()
cmd.ExecuteNonQuery()
End Using
End Sub

Related

Generating a materialized join table with a many-to-one column in ksqldb

thanks in advance for any input!
I have the requirement of retrieving data from 4 of the below databases via 1 HTTP request. I've chosen to create a materialized table with KSQLDB which will contain all relevant data from the 4 database tables. My API Gateway will then query that table using KSQLDB's rest API.
My struggle is in creating 1 KSQLDB table to show information for a purchase order webpage which consists of data from all 4 of the below services shown below:
vendor_tbl
contract_tbl (has vendorId column referencing vendor_table's PK)
services_tbl (has contractId column referencing contract_table's PK)
purchase_order_tbl (has both vendorId && contractId columns referencing vendor/contract table's PKs)
The issue lies mainly with the services table, because it has a many-to-one relationship with contracts. 1 service is for 1 contract, but 1 contract can have many services, which is common.
The structure as is works perfectly in the RDBMS context, but I'm struggling to find any lane using KSQLDB to create 1 materialized table containing all of the services ID's for the PO, along with data from corresponding, contract, vendor && PO tables...
What I have tried:
1.
I have tried creating 2 streams, 1 to join 2/4 of the streams each and then "daisy-chaining" the 2 joined streams as such:
CREATE STREAM po_vendor_join AS SELECT * FROM po_stream p INNER JOIN vendors_tbl v ON p.vendorId = v.id;
CREATE STREAM service_contract_join AS SELECT * FROM services_stream s INNER JOIN contracts_tbl c ON s.contractId = c.id;
This works. But of course there are duplicate entries in the service_contract_join stream, but the next step anyway is to create a table from these 2 streams, but I cannot do this because of the aggregation/group by requirements, which require that every column in the table be part of the aggregation. I understand that I would have duplicates in OTHER columns (vendor PK, contract PK is referenced in multiple tables for instance) but the PK itself is UNIQUE, and needs to be the id of the services_tbl in this case, since there are multiple for a PO. (Note that I have also tried OUTER LEFT && RIGHT joins on the streams)
2.
I have tried "staggering" between table/stream, as such:
CREATE TABLE vendors_tbl (id VARCHAR PRIMARY KEY, name VARCHAR) WITH (KAFKA_TOPIC='vendors', VALUE_FORMAT='json', PARTITIONS=1);
CREATE STREAM contracts_stream (id VARCHAR, vendorId VARCHAR) WITH (KAFKA_TOPIC='contracts', VALUE_FORMAT='json', PARTITIONS=1);
And then rendering a joined stream between the table/stream like so:
CREATE STREAM po_vendor_join
AS SELECT p.*, v.*
FROM po_stream p
INNER JOIN vendors_tbl v ON p.vendorId = v.id;
But then when trying to join the stream, of course I am met with the same restrictions as in attempt # 1.
3.
I have tried making all 4 of the services tables, and simply modeling them as non-queryable tables.
The problem with this approach arises when I try to create a join table between 2 of the tables, like below:
CREATE TABLE PO_HOLLISTIC_AGGREGATE
AS
SELECT * FROM services_contracts_join_tbl scj
INNER JOIN po_vendors_join_tbl pvj ON scj.c_vendorId = pvj.v_id;
I receive the error here stating that the join needs to be on the PK of the right table, but again, the PK required here would be of the services table because there are multiple.
This makes me conclude that the only way this would be possible with KSQLDB is if I stored the service PK on the other 3 streams, which wouldn't be doable either really because of the aggregation restrictions when creating a table via joined streams.
I'd appreciate any ideas, thanks again!

Is it OK to store transactional primary key on data warehouse dimension table to relate between fact-dim?

I have data source (postgres transactional system) like this (simplified, the actual tables has more fields than this) :
Then I need to create an ETL pipeline, where the required report is something like this :
order number (from sales_order_header)
item name (from sales_order_lines)
batch shift start & end (from receiving_batches)
delivered quantity, approved received quantity, rejected received quantity (from receiving_inventories)
My design for fact-dim tables is this (simplified).
What I don't know about, is the optimal ETL design.
Let's focus on how to insert the fact, and relationship between fact with dim_sales_orders
If I have staging tables like these:
The ETL runs daily. After 22:00, there will be no more receiving, so I can run the ETL at 23:00.
Then I can just fetch data from sales_order_header and sales_order_lines, so at 23:00, the script can runs, kind of :
INSERT
INTO
staging_sales_orders (
SELECT
order_number,
item_name
FROM
sales_order_header soh,
sales_order_lines sol
WHERE
soh.sales_order_id = sol.sales_order_header_id
and date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
);
And for the fact table, can runs at 23:30, with query
SELECT
soh.order_number,
rb.batch_shift_start,
rb.batch_shift_end,
sol.item_name,
ri.delivered_quantity,
ri.approved_received_quantity,
ri.rejected_received_quantity
FROM
receiving_batches rb,
receiving_inventories ri,
sales_order_lines sol,
sales_order_header soh
WHERE
rb.batch_id = ri.batch_id
AND ri.sales_order_line_id = sol.sales_order_line_id
AND sol.sales_order_header_id = soh.sales_order_id
AND date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
But how to optimally load the data into fact table, particulary the fact table?
My approach
select from staging_sales_orders and insert them into dim_sales_orders, using auto increment primary key.
before inserting into fact_receiving_inventories, I need to know the dim_sales_order_id. So in that case, I select :
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
order_number = staging_row.order_number
AND item_name = staging_row.item_name
then insert to fact table.
Now what I doubt, is on point 2 (selecting from existing dim). In here, I select based on 2 varchar columns, which should be performance hit. Since in the normalized form, I'm thinking of modifying the staging tables, adding sales_order_line_id on both staging tables. Hence, during point 2 above, I can just do
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
sales_order_line_id = staging_row.sales_order_line_id
But as consequences, I will need to add sales_order_line_id into dim_sales_orders, which I don't find common on tutorials. I mean, adding transactional table PK, is technically can be done since I can access the data source. But is it a good DW fact-dim dimension, to add such transactional field (especially since it is PK)?
Or there is any other approach, rather than selecting the existing dim based on 2 varchars?
How to optimally select dimension id for fact tables?
Thanks
It is practically mandatory to include the source PK/BK in a dimension.
The standard process is to load your Dims and then load your facts. For the fact loads you translate the source data to the appropriate Dim SKs with lookups to the Dims using the PK/BK

PostgreSQL: prevent lock on self table update with left join

I'm on PostgreSQL 9.3. I'm the only one working on the database, and my code run queries sequentially for unit tests.
Most of the times the following UPDATE query run without problem, but sometimes it makes locks on the PostgreSQL server. And then the query seems to never ends, while it takes only 3 sec normally.
I must precise that the query run in a unit test context, i.e. data is exactly the same whereas the lock happens or not. The code is the only process that updates the data.
I know there may be lock problems with PostgreSQL when using update query for a self updating table. And most over when a LEFT JOIN is used.
I also know that a LEFT JOIN query can be replaced with a NOT EXISTS query for an UPDATE but in my case the LEFT JOIN is much faster because there is few data to update, while a NOT EXISTS should visit quite all row candidates.
So my question is: what PostgreSQL commands (like Explicit Locking LOCK on table) or options (like SELECT FOR UPDATE) I should use in order to ensure to run my query without never-ending lock.
Query:
-- for each places of scenario #1 update all owners that
-- are different from scenario #0
UPDATE t_territories AS upt
SET id_owner = diff.id_owner
FROM (
-- list of owners in the source that are different from target
SELECT trg.id_place, src.id_owner
FROM t_territories AS trg
LEFT JOIN t_territories AS src
ON (src.id_scenario = 0)
AND (src.id_place = trg.id_place)
WHERE (trg.id_scenario = 1)
AND (trg.id_owner IS DISTINCT FROM src.id_owner)
-- FOR UPDATE -- bug SQL : FOR UPDATE cannot be applied to the nullable side of an outer join
) AS diff
WHERE (upt.id_scenario = 1)
AND (upt.id_place = diff.id_place)
Table structure:
CREATE TABLE t_territories
(
id_scenario integer NOT NULL,
id_place integer NOT NULL,
id_owner integer,
CONSTRAINT t_territories_pk PRIMARY KEY (id_scenario, id_place),
CONSTRAINT t_territories_fkey_owner FOREIGN KEY (id_owner)
REFERENCES t_owner (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE RESTRICT
)
I think that your query was locked by another query. You can find this query by
SELECT
COALESCE(blockingl.relation::regclass::text,blockingl.locktype) as locked_item,
now() - blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid,
blockeda.query as blocked_query, blockedl.mode as blocked_mode,
blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
blockingl.mode as blocking_mode
FROM pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
( (blockingl.transactionid=blockedl.transactionid) OR
(blockingl.relation=blockedl.relation AND blockingl.locktype=blockedl.locktype)
) AND blockedl.pid != blockingl.pid)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE NOT blockedl.granted
AND blockinga.datname = current_database()
This query I've found here http://big-elephants.com/2013-09/exploring-query-locks-in-postgres/
Also can use ACCESS EXCLUSIVE LOCK to prevent any query to read and write table t_territories
LOCK t_territories IN ACCESS EXCLUSIVE MODE;
More info about locks here https://www.postgresql.org/docs/9.1/static/explicit-locking.html

Update or insert with outer join in postgres

Is it possible to add a new column to an existing table from another table using insert or update in conjunction with full outer join .
In my main table i am missing some records in one column in the other table i have all those records i want to take the full record set into the maintable table. Something like this;
UPDATE maintable
SET all_records= othertable.records
FROM
FULL JOIN othertable on maintable.col = othertable.records;
Where maintable.col has same id a othertable.records.
I know i could simply join the tables but i have a lot of comments in the maintable i don't want to have to copy paste back in if possible. As i understand using where is equivalent of a left join so won't show me what i'm missing
EDIT:
What i want is effectively a new maintable.col with all the records i can then pare down based on presence of records in other cols from other tables
Try this:
UPDATE maintable
SET all_records = o.records
FROM othertable o
WHERE maintable.col = o.records;
This is the general syntax to use in postgres when updating via a join.
HTH
EDIT
BTW you will need to change this - I used your example, but you are updating the maintable with the column used for the join! Your set needs to be something like SET missingcol = o.extracol
AMENDED GENERALISED ANSWER (following off-line chat)
To take a simplified example, suppose that you have two tables maintable and subtable, each with the same columns, but where the subtable has extra records. For both tables id is the primary key. To fill maintable with the missing records, for pre 9.5 versions of Postgres you must use the following syntax:
INSERT INTO maintable (SELECT * FROM subtable s WHERE NOT EXISTS
(SELECT 1 FROM maintable m WHERE m.id = s.id));
Since 9.5 there is a (preferred) alternative:
INSERT INTO maintable (SELECT * FROM subtable) ON CONFLICT DO NOTHING;
This is preferred because (apart from being simpler) it avoids the situation that has been known to arise in the former, where a race condition is created between the INSERT and the sub-SELECT.
Obviously when the columns are different, you need to specify in the INSERT statement which columns are inserted from which. Something like:
INSERT INTO maintable (id, ColA, ColB)
(SELECT id, ColE, ColG FROM subtable ....)
Similarly the common field might not be id in both tables. However, the simplified example should be enough to point you in the right direction.

Postgresql get references from a dictionary

I'm trying to build a request to get the data from a table, but some of those columns have foreign keys I would like to replace by the associated keyword in one request.
Basically there's
table A with column 1:PKA-ID and column 2:name.
table B with column 1:PKB-ID, column 2:FKA-ID, column 3:amount.
I want to get all the lines in table B but with all foreign keys replaced by the associated names in table A.
I started building a request with a subrequest + alias to get that, but ofc I have more than one result per subrequest, yet I can't find a way to link that subrequest to the ID of table B [might be exhausted, dumb or both] from the main request. I did something like that:
SELECT (SELECT "NAME" FROM A JOIN B ON ID = FKA-ID) AS name, amount FROM TABLEB;
it feels so simple of a request yet...
You don't need a join in the subselect.
SELECT pkb_id,
(SELECT name FROM a WHERE a.pka_id = b.fka_id),
amount
FROM b;
(See it live in SQL Fiddle).
The subselect query runs for each and every row of its parent select and has the parent row available from the context.
You can also use a simple join.
SELECT b.pkb_id, a.name, b.amount
FROM b, a
WHERE a.pka_id = b.fka_id;
Note that the join version puts less restrictions on the PostgreSQL query optimizer so in some cases the join version might work faster. (For example, in PostgreSQL 9.6 the join might utilize multiple CPU units, cf. Parallel Query).