Storing duplicate data as a column in Postgres? - postgresql

In some database project, I have a users table which somehow has a computed value avg_service_rating. And there is another table called services with all the services associated to the user and the ratings for that service. Is there a computationally-lite way which I can maintain the avg_service_rating rating without updating it every time an INSERT is done on the services table? Perhaps like a generate column but with a function call instead? Any direct advice or link to resources will be greatly appreciated as well!
CREATE TABLE users (
username VARCHAR PRIMARY KEY,
avg_service_ratings NUMERIC -- is it possible to store some function call for this column?,
...
);
CREATE TABLE service (
username VARCHAR NOT NULL REFERENCE users (username);
service_date DATE NOT NULL,
rating INTEGER,
PRIMARY KEY (username, service_date),
);

If the values should be consistent, a generated column won't fit the bill, since it is only recomputed if the row itself is modified.
I see two solutions:
have a trigger on the services table that updates the users table whenever a rating is added or modified. That slows down data modifications, but not your queries.
Turn users into a view. The original users table would be renamed, and it loses the avg_service_rating column, which is computed on the fly by the view.
To make the illusion perfect, create an INSTEAD OF INSERT OR UPDATE OR DELETE trigger on the view that modifies the underlying table. Then your application does not need to be changed.
With this solution you pay a certain price both on SELECT and on data modifications, but the latter price will be lower, since you don't have to modify two tables (and users might receive fewer modifications than services). An added advantage is that you avoid data duplication.

A generated column would only be useful if the source data is in the same table row.
Otherwise your options are a view (where you could call a function or calculate the value via a subquery), or an AFTER UPDATE OR INSERT trigger on the service table, which updates users.avg_service_ratings. With a trigger, if you get a lot of updates on the service table you'd need to consider possible concurrency issues, but it would mean the figure doesn't need to be calculated every time a row in the users table is accessed.

Related

PostgreSQL - on conflict update for GENERATED ALWAYS AS IDENTITY

I have a table of values which I want to manage with my application ...
let's say this is the table
CREATE TABLE student (
id_student int4 NOT NULL GENERATED ALWAYS AS IDENTITY,
id_teacher int2
student_name varchar(255),
age int2
CONSTRAINT provider_pk PRIMARY KEY (id_student)
);
In the application, each teacher can see the list of all his/her students .. and they can edit or add new students
I am trying to figure out how to UPSERT data in the table in PostgreSQL ... what I am doing now is for each teacher (after the manipulation in the app) they are allowed to edit on FE only in JS (without the necessity of saving each change individually)... so after the edit, they click SAVE button and that's the time I need to store the changes and new records in the DB ...
what I do now, is I delete all records for that particular teacher and store the new object/array they created (by editing, adding, .. whatever) - so it's easy and I don't have to check for changes and new records ... the drawbacks is a brutal waste of the sequence for ID_STUDENT (autogenerated on the DB side) and of course a huge overhead on indexes while inserting (=rebuilding) considering there will be a lot of teachers saving a lot of their students .. that might cause some perf issues ... not to mention the fragmenting (HWM) so I would have to VACUUM regularly on this table
In Oracle, I could easily use MERGE INTO (which is fantastic for this use case) but the MERGE is not in the PostgreSQL :(
the only thing I know about is the INSERT ON CONFLICT UPDATE ... but the problem is, how am I supposed to apply this on GENERATED ALWAYS AS IDENTITY key? I do not provide this sequence (on top of that I don't even know the latest number) and therefore I cannot trigger the ON CONFLICT (id_student) ....
is there any nice way out of this sh*t ? Or DELETE / INSERT is really the way to go?
You shouldn't be too worried about the data churn – after all, an UPDATE also writes a new version of the row, so it wouldn't be that much different. And the sequence is no problem, because you used bigint for the primary key, right (anything else would have been a mistake)?
If you want to use INSERT ... ON CONFLICT in combination with an auto-generated sequence, you need some way beside the primary key to identify a row, that is, you need a UNIQUE constraint that you can use with ON CONFLICT. If there is no candidate for such a constraint, how can you identify the records for the teacher?

Postgres Unique Sequences in one table based on owner/foreign key

I am creating a web application that will store all user information in one database using permissions, roles, and FKs to restrict data access. One of the tables in this application tracks work orders created by each user (i.e. the work order table has an FK to the user table).
I am wanting to ensure that each user has their own uninterrupted sequence of 'work order IDs' that are assigned when the work order is scheduled. That is, if user 1 creates his first work order, it will assign it #1, however, if user 2 creates his fifth work order, it will assign it #5.
The work order table has a UUID primary key, so each record is distinguishable, and the user FK has a not-null constraint.
Based on my research so far, it seems like Postgres Sequences would likely be my best answer. I would need to create a sequence for each user, and incorporate it into a trigger to stamp the work order record with the next appropriate ID. However, this seems like it would be very performance intensive, and creating a new sequence for every user would have its own set of challenges.
A second approach could be to create a second table that tracks each user's latest sequence, query it, increment it, and update both the work order table and the number tracking table. However, in this scenario, I think it would be susceptible to race conditions if two users were to convert records at exactly the same time.
I'm unsure what the best way to solve the problem would be. Is there another way that would provide better performance?
Sequences won't work for you, because they are not transactional by design: if an insert with a generated number fails, that number is consumed even after a ROLLBACK.
You should create a second table
CREATE TABLE counters (
user_id bigint PRIMARY KEY REFERENCES users ON DELETE CASCADE,
work_order_id bigint NOT NULL DEFAULT 0
);
Then you get the next number with
UPDATE counters
SET work_order_id = work_order_id + 1
RETURNING work_order_id;
That is atomic and safe from race conditions. Just make sure you run that update and the insert in the same database transaction, then they will either both succeed or both fail and be undone.
This will serialize inserts into the work orders table per user, but gap-less sequences are always a performance problem.

"ON UPDATE" equivalent for Amazon Redshift

I want a create a table that has a column updated_date that is updated to SYSDATE every time any field in that row is updated. How should I do this in Redshift?
You should be creating table definition like below, that will make sure whenever you insert the record, it populates sysdate.
create table test(
id integer not null,
update_at timestamp DEFAULT SYSDATE);
Every time field update?
Remember, Redshift is DW solution, not a simple database, hence updates should be avoided or minimized.
UPDATE= DELETE + INSERT
Ideally instead of updating any record, you should be deleting and inserting it, so takes care of update_at population while updating which is eventually, DELETE+INSERT.
Also, most of use ETLs, you may using stg_sales table for populating you date, then also, above solution works, where you could do something like below.
DELETE from SALES where id in (select Id from stg_sales);
INSERT INTO SALES select id from stg_sales;
Hope this answers your question.
Redshift doesn't support UPSERTs, so you should load your data to a temporary/staging table first and check for IDs in the main tables, which also exist in the staging table (i.e. which need to be updated).
Delete those records, and INSERT the data from the staging table, which will have the new updated_date.
Also, don't forget to run VACUUM on your tables every once in a while, because your use case involves a lot of DELETEs and UPDATEs.
Refer this for additional info.

Way to migrate a create table with sequence from postgres to DB2

I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?
You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.

How to maintain record history on table with one-to-many relationships?

I have a "services" table for detailing services that we provide. Among the data that needs recording are several small one-to-many relationships (all with a foreign key constraint to the service_id) such as:
service_owners -- user_ids responsible for delivery of service
service_tags -- e.g. IT, Records Management, Finance
customer_categories -- ENUM value
provider_categories -- ENUM value
software_used -- self-explanatory
The problem I have is that I want to keep a history of updates to a service, for which I'm using an update trigger on the table, that performs an insert into a history table matching the original columns. However, if a normalized approach to the above data is used, with separate tables and foreign keys for each one-to-many relationship, any update on these tables will not be recognised in the history of the service.
Does anyone have any suggestions? It seems like I need to store child keys in the service table to maintain the integrity of the service history. Is a delimited text field a valid approach here or, as I am using postgreSQL, perhaps arrays are also a valid option? These feel somewhat dirty though!
Thanks.
If your table is:
create table T (
ix int identity primary key,
val nvarchar(50)
)
And your history table is:
create table THistory (
ix int identity primary key,
val nvarchar(50),
updateType char(1), -- C=Create, U=Update or D=Delete
updateTime datetime,
updateUsername sysname
)
Then you just need to put an update trigger on all tables of interest. You can then find out what the state of any/all of the tables were at any point in history, to determine what the relationships were at that time.
I'd avoid using arrays in any database whenever possible.
I don't like updates for the exact reason you are saying here...you lose information as it's over written. My answer is quite simple...don't update. Not sure if you're at a point where this can be implemented...but if you can I'd recommend using the main table itself to store historical (no need for a second set of history tables).
Add a column to your main header table called 'active'. This can be a character or a bit (0 is off and 1 is on). Then it's a bit of trigger magic...when an update is preformed, you insert a row into the table identical to the record being over-written with a status of '0' (or inactive) and then update the existing row (this process keeps the ID column on the active record the same, the newly inserted record is the inactive one with a new ID).
This way no data is ever lost (admittedly you are storing quite a few rows...) and the history can easily be viewed with a select where active = 0.
The pain here is if you are working on something already implemented...every existing query that hits this table will need to be updated to include a check for the active column. Makes this solution very easy to implement if you are designing a new system, but a pain if it's a long standing application. Unfortunately existing reports will include both off and on records (without throwing an error) until you can modify the where clause