Data Warehouse - unique constraint on dimension tables, is such design possible? - postgresql

The database used is postgresql.
Suppose I have dimension table dim_orders with fields describing order line (one order number can has many order line, different on item name):
order_id (auto-increment primary key)
order_number
order_status (NEW, PAID, ORDERED, ...)
item_name
...
Then I have ETL process that runs hourly. The data source is from sales database. The problem is, we have order_line_status field, which can change hour-to-hour (e.g. one cycle will be NEW > PAID > ORDERED > DELIVERED > CLOSED) on the data source. And this order status can be different for each order. For example, when order X has line X1 (item : chocolate) and line X2 (item : coffee), the X1 chocolate might already DELIVERED, but X2 coffee still on ORDERED.
My fact_sales table is something like this:
sales_id (auto-increment primary key)
order_id (foreign key to dim_orders), which basically the order line : chocolate or coffee on sample above
quantity (taken from sales data)
discount_amount (taken from sales data)
...
To maintain speed, I'd like to avoid network / SQL call everytime the ETL process runs.
This is because the network sometimes full and quite slow.
Right now, every ETL process one sales data, I query from dim_orders :
**existing_order_id = "SELECT order_id FROM dim_orders WHERE order_number = [staging_data.order_number] AND item_name = [staging_data.item_name]"**
if existing_order_id found then:
- update dim_orders set order_status = new order status from staging data (it might changed or not since last hour ETL)
- update fact_sales, using data on staging table
else:
- insert into dim_sales, get new order_id. Something like (INSERT INTO dim_orders ... RETURNING order_id)
- insert into fact_sales, using order_id
The problem is on the bold statement, where I always query for existing order ID. If I have 10k rows, this means 10k select before processing the data. I'm trying to change the dim_orders, using unique key on (order_number, item_name), so instead of select-insert/update, I can have something like this (I think)
A. upsert data from staging_table into dim_orders. If existing (order_number, item_name) match, update the order_status.
B. process the fact table, using returned order id
Since we can get the order id from fact table, query A can be achieved by:
INSERT INTO dim_orders (order_number, item_name, order_status, ...) VALUES (...)
ON CONFLICT(order_number, item_name) DO UPDATE
SET order_status = excluded.order_status
RETURNING order_id
Now, the only problem to do this, I have to add unique constraint on dim_orders where (order_number, item_name) must be unique.
However, this design is rejected, because we never have unique constraint on dimension tables before. But from what I get, the reason is because we never done it before.
So my question is:
is it OK to add unique constraint on star schema (fact / dim tables)? Or it is actually bad by design, why?
In term of data warehouse, is there any other approach for this kind of select-insert/update problem?
Thanks

Related

oracle stop select duplicated value

I am trying to insert data in my table from another which has two column (employee number) and (branch),
and whenever new data is inserted the employee last number value is increased,
my code is working fine but if the there are more than one employee inserted at the same time they will have duplicated value.
for example, if I inserted the data with branch number is 100 the employee will have number 101, and if the branch number is 200 the employee will have number 201 etc.
but if data inserted for two employees both have same branch for example number 200 both of them will have number 201, but I want the first one to have 201 and the second one to have 202,
I hope you get what I mean, any help will be appreciated.
here is my code:
insert into emp_table_1
Emp_Name_1,
Emp_Branch_1,
Emp_number_1
Select Emp_Name_2 ,
Emp_Branch_2,
Case emp_branch
When '100' Then (Select Max(Emp_number_1)+1 From emp_table_1 Where Branch_Cd=100)
When '200' Then (Select Max(Emp_number_1)+1 From emp_table_1 Where Branch_Cd=200)
End As Emp_number_2
From emp_table_2
Don't try to have sequential numbers for each branch and don't try to use MAX to find the next number in the sequence.
Use a sequence (that is what they are designed for).
CREATE SEQUENCE employee_id__seq;
Then you can use:
insert into emp_table_1 (Emp_Name_1, Emp_Branch_1, Emp_number_1)
Select Emp_Name_2 ,
Emp_Branch_2,
employee_id__seq.NEXTVAL
From emp_table_2
Then each employee will have a unique number (which you can use as a primary key) and you will not get concurrency issues if multiple people try to create new users at the same time.
Or, from Oracle 12, you could use an identity column in your table:
CREATE TABLE emp_table_1(
emp_name_1 VARCHAR2(200),
emp_branch_1 CONSTRAINT emp_table_1__branch__fk REFERENCES branch_table (branch_id),
emp_number_1 NUMBER(8,0)
GENERATED ALWAYS AS IDENTITY
CONSTRAINT emp_table_1__number__pk PRIMARY KEY
);
Then your query is simply:
insert into emp_table_1 (Emp_Name_1, Emp_Branch_1)
Select Emp_Name_2 ,
Emp_Branch_2
From emp_table_2
And the identity column will be auto-generated.

Is it OK to store transactional primary key on data warehouse dimension table to relate between fact-dim?

I have data source (postgres transactional system) like this (simplified, the actual tables has more fields than this) :
Then I need to create an ETL pipeline, where the required report is something like this :
order number (from sales_order_header)
item name (from sales_order_lines)
batch shift start & end (from receiving_batches)
delivered quantity, approved received quantity, rejected received quantity (from receiving_inventories)
My design for fact-dim tables is this (simplified).
What I don't know about, is the optimal ETL design.
Let's focus on how to insert the fact, and relationship between fact with dim_sales_orders
If I have staging tables like these:
The ETL runs daily. After 22:00, there will be no more receiving, so I can run the ETL at 23:00.
Then I can just fetch data from sales_order_header and sales_order_lines, so at 23:00, the script can runs, kind of :
INSERT
INTO
staging_sales_orders (
SELECT
order_number,
item_name
FROM
sales_order_header soh,
sales_order_lines sol
WHERE
soh.sales_order_id = sol.sales_order_header_id
and date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
);
And for the fact table, can runs at 23:30, with query
SELECT
soh.order_number,
rb.batch_shift_start,
rb.batch_shift_end,
sol.item_name,
ri.delivered_quantity,
ri.approved_received_quantity,
ri.rejected_received_quantity
FROM
receiving_batches rb,
receiving_inventories ri,
sales_order_lines sol,
sales_order_header soh
WHERE
rb.batch_id = ri.batch_id
AND ri.sales_order_line_id = sol.sales_order_line_id
AND sol.sales_order_header_id = soh.sales_order_id
AND date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
But how to optimally load the data into fact table, particulary the fact table?
My approach
select from staging_sales_orders and insert them into dim_sales_orders, using auto increment primary key.
before inserting into fact_receiving_inventories, I need to know the dim_sales_order_id. So in that case, I select :
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
order_number = staging_row.order_number
AND item_name = staging_row.item_name
then insert to fact table.
Now what I doubt, is on point 2 (selecting from existing dim). In here, I select based on 2 varchar columns, which should be performance hit. Since in the normalized form, I'm thinking of modifying the staging tables, adding sales_order_line_id on both staging tables. Hence, during point 2 above, I can just do
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
sales_order_line_id = staging_row.sales_order_line_id
But as consequences, I will need to add sales_order_line_id into dim_sales_orders, which I don't find common on tutorials. I mean, adding transactional table PK, is technically can be done since I can access the data source. But is it a good DW fact-dim dimension, to add such transactional field (especially since it is PK)?
Or there is any other approach, rather than selecting the existing dim based on 2 varchars?
How to optimally select dimension id for fact tables?
Thanks
It is practically mandatory to include the source PK/BK in a dimension.
The standard process is to load your Dims and then load your facts. For the fact loads you translate the source data to the appropriate Dim SKs with lookups to the Dims using the PK/BK

Prevent two threads from selecting same row ibm db2

I have a situation where I have multiple (potentially hundreds) threads repeating the same task (using a java scheduled executor, if you are curious). This task entails selecting rows of changes (from a table called change) that have not yet been processed (processed changes are kept track in a m:n join table called process_change_rel that keeps track of the process id, record id and status) processing them, then updating back the status.
My question is, how is the best way to prevent two threads from the same process from selecting the same row? Will the below solution (using for update to lock rows ) work? If not, please suggest a working solution
Create table change(
—id , autogenerated pk
—other fields
)
Create table change_process_rel(
—change id (pk of change table)
—process id (pk of process table)
—status)
Query I would use is listed below
Select * from
change c
where c.id not in(select changeid from change_process_rel with cs) for update
Please let me know if this would work
You have to "lock" a row which you are going to process somehow. Such a "locking" should be concurrent of course with minimum conflicts / errors.
One way is as follows:
Create table change
(
id int not null generated always as identity
, v varchar(10)
) in userspace1;
insert into change (v) values '1', '2', '3';
Create table change_process_rel
(
id int not null
, pid int not null
, status int not null
) in userspace1;
create unique index change_process_rel1 on change_process_rel(id);
Now you should be able to run the same statement from multiple concurrent sessions:
SELECT ID
FROM NEW TABLE
(
insert into change_process_rel (id, pid, status)
select c.id, mon_get_application_handle(), 1
from change c
where not exists (select 1 from change_process_rel r where r.id = c.id)
fetch first 1 row only
with ur
);
Every such a statement inserts 1 or 0 rows into the change_process_rel table, which is used here as a "lock" table. The corresponding ID from change is returned, and you may proceed with processing of the corresponding event in the same transaction.
If the transaction completes successfully, then the row inserted into the change_process_rel table is saved, so, the corresponding id from change may be considered as processed. If the transaction fails, the corresponding "lock" row from change_process_rel disappears, and this row may be processed later by this or another application.
The problem of this method is, that when both tables become large enough, such a sub-select may not work as quick as previously.
Another method is to use Evaluate uncommitted data through lock deferral.
It requires to place the status column into the change table.
Unfortunately, Db2 for LUW doesn't have SKIP LOCKED functionality, which might help with such a sort of algorithms.
If, let's say, status=0 is "not processed", and status<>0 is some processing / processed status, then after setting these DB2_EVALUNCOMMITTED and DB2_SKIP* registry variables and restart the instance, you may "catch" the next ID for processing with the following statement.
SELECT ID
FROM NEW TABLE
(
update
(
select id, status
from change
where status=0
fetch first 1 row only
)
set status=1
);
Once you get it, you may do further processing of this ID in the same transaction as previously.
It's good to create an index for performance:
create index change1 on change(status);
and may be set this table as volatile or collect distribution statistics on this column in addition to regular statistics on table and its indexes periodically.
Note that such a registry variables setting has global effect, and you should keep it in mind...

I found the inconsistency data on postgres

I have a table of datas on postgres. This table, called it 'table1', have a unique constraint on field 'id'. this table also have 3 other fields, 'write_date', 'state', 'state_detail'.
all this time, i got no problem when accesing and joining these table with another table with field 'id' as the relational field. but, this time, i got a strange result when i querying this table1.
when i run this query (called it Query1):
SELECT id, write_date, state, state_detail
FROM table1
WHERE write_date = '2019-07-30 19:42:49.314' or write_date = '2019-07-30 14:29:06.945'
it gives me 2 rows of datas, with the same id, but different value for the other fields:
id || write_date || state || state_detail
168972 2019-07-30 14:29:06.945 1 80
168972 2019-07-30 19:42:49.314 2 120
BUT, when i run this query (called it Query2):
SELECT id, write_date, state, state_detail
FROM table1
WHERE id = 168972
it gives me just 1 row:
id || write_date || state || state_detail
168972 2019-07-30 19:42:49.314 2 120
How come it gives the different result. i mean, i checked 'table1', it has the unique constraint 'id' as primary key. But, how come this happened?
i have restart the postgres service, and i run those 2 queries again. And it still gives me the same result as above.
This looks like a case of index corruption, specifically on the unique index on the id column. Could you run the following query:
SELECT ctid, id, write_date, state, state_detail FROM table1
WHERE write_date = '2019-07-30 19:42:49.314' or write_date = '2019-07-30 14:29:06.945'
You will likely receive 2 rows back for the id, with two different ctids. The ctid represents the physical location on disk for each row. Presuming you get two rows back, you will need to pick a row and delete the other one in order to "fix" the data. In addition, you'll want to recreate the unique index on the table, in order to prevent this from happening again.
Oh, and don't be surprised if you find other rows in the table like this. Depending on the source of the corruption (bad disks, bad memory, recent crash, upgrade to glibc?), this could be the sign of future trouble to come. I'd probably double-check all my tables for any issues, recreate any unique indexes, and look at the OS level for any signs of disk corruption or I/O errors. And if you aren't on the latest version of Postgres, you should upgrade.

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum